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PREFACE 


The original mimeographed edition (1938) of Lectures and Conferences on 
Mathematical Statistics was exhausted within two years of its publication. 
This, together with the subsequent continued inquiries from various persons 
and institutions, suggested that broad circles of statisticians are in need of 
a book such as this which gives the general ideas behind the theory of sta- 
tistics and behind its applications. Unfortunately, certain circumstances 
prevented an earlier reissue of the book. 

The present edition differs substantially from the first by an omission, 
by several additions and by reformulation of a considerable part of the 
earlier material. Owing to the extraordinary development of the econo- 
metric school on the one hand and of the works on stochastic processes 
on the other, the relevant Conference in the first edition became out of 
date and was omitted entirely. The interested reader is referred to arti- 
cles in Econometrica, particularly to those of Ragnar Frisch, T. J. Koop- 
mans, Oscar Lange and J. Marschak. In addition, he will find it both 
interesting and instructive to study the articles of J. L. Doob and W. Feller 
recently published in the Proceedings of the Berkeley Symposium on Mathe- 
matical Statistics and Probability. 

Sporadic additions to the original material are inserted throughout 
the book. However, there are a few sections which deserve special mention. 
One such section is concerned with sampling human populations. Specifi- 
cally, Parts 1 and 2 of Chapter III include a systematic presentation of the 
theory. Part 2 reproduces an article published some time ago in the Journal 
of the American Statistical Association and it is a pleasure to record my 
indebtedness to the Editor for the kind permission to do so. 

The next substantial addition is Part 3 of Chapter III, which deals with 
spurious methods of studying correlation. Although the subject is not novel, 
the inclusion of a special section given to it seems justified by the fact that 
it appears to have been neglected by other authors while many empirical 
studies continue to involve errors of the kind described. 

Although the earlier edition of Lectures and Conferences contains a 
counterpart of the present Chapter IV, there is a very substantial difference 
in presentation and a considerable addition of material. This chapter gives 


1 University of California Press, Berkeley and Los Angeles, 1949, 501 pp. 
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PREFACE 


a three-cornered discussion of the ideas of estimation, from the point of view 
of Bayes’ formula, from the point of view of confidence intervals and from 
the point of view of fiducial argument. Since the publication of the first 
edition of Lectures and Conferences, there has occurred a certain shift in 
“allegiances” exemplified by the fact that a large section devoted to fiducial 
distributions, present in an early edition of a book by an eminent author, 
does not appear in his subsequent books, which contain, instead, sections on 
confidence intervals. However, indications of the confusion of the Bayes’ 
and the more modern treatment of the problem are still noticeable in certain 
sections of the literature and misconceptions involved in the fiducial argu- 
ment appear about as frequently. For this reason it seemed advisable to 
subject the matter to a detailed discussion. Here I wish to record my hearty 
thanks to Professor E. S. Pearson, the Editor of Biometrika, for his kind 
permission to reproduce my article, originally published in that journal. 

Part 4 of Chapter IV is entirely new and is given over to the brilliant 
recent result of Charles M. Stein. 

Before concluding, I take pleasure in expressing my hearty thanks to 
Dr. Evelyn Fix for her invaluable help in the preparation of this book, 
for preparing the numerical illustrations, for reading and correcting the 
manuscript, and for kindly advice and suggestions. 

J. NEYMAN 
March, 1952 
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CHAPTER I 


The Modern Viewpoint on the Classical Theory of Probability 
and Its Applications. Tests of Statistical Hypotheses 


(The contents of this chapter are based on three lectures delivered at the Graduate 
School of the United States Department of Agriculture in April, 1937.) 


Introduction 


After the original titles of my lectures had been fixed, I received a number 
of letters from members of the prospective audience and these letters forced 
me to modify the original programme and to place more emphasis than 
I had intended on concepts basic in the theory of probability and statistics. 

The concept of probability has been discussed and defined in many dif- 
ferent ways, each having its own advantage. It must be emphasized that, 
although the respective theories frequently contradict each other, this does 
not necessarily mean that some of them are wrong. Any theory is correct 
as long as the axioms on which it is based are not mutually contradictory 
and as long as there are no errors in deductions. Among the existing 
systems of axioms and theories deducible from them, we must make a choice. 
In this we shall be guided by considerations of usefulness or, by what fre- 
quently amounts to the same thing, our personal taste. It is important, 
however, to make clear the theory in which one is working. Otherwise, 
unnecessary misunderstandings may arise. 

In my first lecture I shall describe the basic ideas of the theory of proba- 
bility that I prefer and have had in mind when working on the theories of 
testing statistical hypotheses and of estimation. 

So far as I am aware these views of mine are shared by E. S. Pearson 
and other workers attached to the Department of Statistics at University 
College, London. It may be, therefore, that the present lectures will help 
one to understand the whole of the work carried on in that centre. 

It would be useless, of course, to try to develop the entire theory of prob- 
ability in only two or three lectures. Therefore I shall concentrate on the 
general ideas, definitions, etc. Details of the theory of probability treated 
from the same point of view, though perhaps using different wordings, may 
be found in various books and papers, of which I shall mention the following: 
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1. H. Cramér: Random variables and probability distributions. Cambridge, 1937. 

2. M. Fréchet: Recherches théoriques modernes sur la théorie des probabilités. 
Gauthier-Villars, Paris, 1937. 

3. A. Kolmogoroff: Grundbegriffe der Wahrscheinlichkeitsrechnung. Julius Springer, 
Berlin, 1933. ) 


Finally, an elementary systematic presentation is given in the recent book: 


J. Neyman: First course in probability and statistics. Henry Holt and Co., New 
York, 1950. 


The second lecture will be given entirely to the question of the possibility 
of applying the mathematical theory of probability to practical problems. 
The ideas developed here have grown out of reading such writers as E. Borel, 
L. v. Bortkiewicz, Karl Pearson and undoubtedly others but it is difficult 
to give exact references. 

In the third and last lecture I shall deal with the somewhat narrower 
but still rather broad question of what is the meaning of a test of a statistical 
hypothesis and what are the grounds for choosing between several alternative 
tests. Material for the third lecture has been taken essentially from an 
article of mine which was published in 1929 in the Proceedings of the First 
Congress of Slavonic Mathematicians in Warsaw. The title of the article 
is “Méthodes nouvelles de vérification des hypothéses statistiques.” 


Part 1. On the Theory of Probability 


1. DEFINITION OF PROBABILITY. Probability as I shall define it will always 
refer to an object of a specified kind, say A, having a certain property, 
say B. Thus we may speak of the probability of a ball having the property 
of being black, of a person 36 years of age “having the property” of dying 
during the next twelve months, etc. It has been usual to define probability 
referring either to events or to propositions. Obviously the choice is very 
much a matter of convenience and it seems to me that speaking of the 
probabilities of objects having certain properties is convenient. Besides, it 
will be noticed that in this nomenclature we may speak also of probabilities 
of events. We will mean the probabilities of events having the property 
of actually occurring. Also it will be possible to speak of probabilities of 
propositions, which will mean the probabilities of propositions having the 
property of being true. The assumed system of expressions seems, therefore, 
to be not less general than the others. 

In a mathematical definition, the actual wording used does not matter 
very much. However, it does have some importance since different wordings 
may appeal to intuition with different strengths and may give different 
emphases to the essential source of the concepts introduced. The essential 
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point in the concept of probability which I will use is that it will always 
refer to a specified set of objects, which I shall describe as the fundamental 
probability set. This point is emphasized in the wording adopted, since 
we agree to speak of the probability of a specified object A having a property 
B. It will be noticed that the process of specifying the object A is equivalent 
to specifying or perhaps even enumerating all objects that are “A” in 
distinction from those that are not. Now, all objects A will form what 
I shall call the fundamental probability set (F.P.S. for short). This will 
also be denoted by (A).? 

It is obvious that in order to be able to enumerate all objects A, these 
objects must be well defined by a specification of one or more properties 
distinguishing the objects A from all other objects. This property will also 
be denoted by the same letter A. 

Before proceeding further I shall explain the terms logical sum and 
logical product of two or more properties. Let B, and Bz be any two 
properties. The property Bs is a logical sum (or sum for short) of B; and 
By, if it consists in an object possessing at least one of the properties B, 
and Bo, and for this sum we shall write Bz = B, + Bo. It will be convenient 
to use an expression like “an object By + Pa to denote an object possessing 
the property B; + Be, ete. 

A property B, will be called a logical product (or product for short) of 
the properties B,; and Bz if it consists in an object possessing both B, and 
B,. We shall use the notation By, = B,B, for this property and use the 
expression “an object B,B,” to denote an object possessing the property 
By Bo. 

The above definitions are immediately extended to the sum and product 
of any number of properties, finite or infinite. 

Turning now to the definition of probability of an object A possessing 
the property B, I want to emphasize that it requires the enumeration of all 
the objects A actually possessing the property B, i.e. all the objects possess- 
ing the property AB. According to the conventions already established, the 
set of those will be denoted by (AB). 

Up to the present time our considerations have been perfectly general. 
Owing to the fact that the mathematical theory of sets is not commonly 
known, further steps leading to the definition of probability will have to be 
discussed twice, once on the assumption that the fundamental probability 
set (A) is finite and next, that it is anything, finite or infinite. 

Suppose that the fundamental probability set (A) is finite, and denote 
by n the number of objects it contains. Further, let k be the number of 


1“(z)” stands for “all x” and analogously for any letter in parentheses. This nota- 
tion is in common use. 
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objects belonging to (A) and having the property B. The probability of 
an object A having the property B will be defined as the ratio k/n, and 
will be denoted by 
k 
P{B| A} = as (1) 


In other words, the probability of an object A having the property B is 
defined as the proportion of objects A having the property B. The expres- 
sion “the probability of an object A having a property B” is, of course, 
somewhat lengthy; we shall therefore use abbreviations such as “the prob- 
ability of B,” but it is necessary to remember the full meaning of these words. 

Whenever there will be no danger of misunderstanding, the above notation 
can be simplified. For instance, if the probabilities that are calculated in 
the course of solving a certain problem refer always to the same funda- 
mental probability set (A), the letter A may be omitted in the symbol of 
probability, whereupon P{B} will suffice for P{B | A}. Sometimes, how- 
ever, we shall have to deal not only with a fundamental probability set (A), 
but also with one or more others, each forming a part of (A). For instance, 
besides dealing with the probability of an object A having a certain property 
B’, we might deal also with the probability of an object AB having the 
same property B’ (or some other). In such cases the probabilities referring 
to objects A may be written without specifying their set, while probabilities 
referring to objects AB may not be: thus, P{B’ | AB} may be shortened 
to P{B’ | B}, and P{B’ | A} may be shortened to P{B’}. 

It is most important to distinguish the probabilities P{B’| A} and 
P{B’| AB}. The former is the proportion of all objects A having the prop- 
erty B’, while the latter is the proportion of objects AB having the property 
B’ in addition to the property AB. Special care in distinguishing these two 
concepts is needed when we use shorter expressions and notations. 

In order to emphasize this distinction we shall sometimes describe 
P{B’ | A} as the absolute probability of B’ and P{B’ | AB} as the relative 
probability of B’ given B. The relative probability of B’ given B may or 
may not be equal to the absolute probability of B’. If it is, then we say 
that the property B’ is independent of B. 

It will be noticed that the definition of probability applies only to cases 
where the fundamental probability set is not empty, that is to say, only 
when it contains at least one element. Otherwise the word probability 
would have no meaning. It follows that whenever we speak of a probability, 
we imply that the fundamental probability set is not empty. 

It follows from the definition that the probability P of any property, E, 
is a fraction between zero and unity. If P = 0, none of the elements of 
the F.P.S. has the property HZ. In this case we can conveniently describe 
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E as an impossible property. If on the other hand P = 1, it follows that 
the property H may be described as a sure property.? It is easily seen that 
the converses are true, namely that if ZH; and EH», are an impossible and a 
sure property respectively, then P{H,} =0 and P{E,}=1. It will be 
noticed that the relative probability P{B’ | B} of B’ given B has a definite 
meaning only if B is not an impossible property. 

The characteristic feature of the above definition of probability is (i) 
that it refers to sets of objects and (11) that it does not involve any reference 
to “equally probable” cases. In order to emphasize the consequences of 
the definition, I shall discuss a few examples. 

Example 1.—A die has six faces, one and only one of which has six points 
on it. The probability of a side of the die having six points on it will be, 
according to our definition, always 1/6. No experiments with die throwing 
are able to alter this conclusion. 

Example 2.—The probability of a side of the die having six points on it 
must be distinguished from the probability of getting six points on the 
die when the die is thrown. 

Reading this last sentence once again and comparing it with the definition 
of probability, equation (1), one will easily see that, without further descrip- 
tion of the situation, the definition of probability could not be applied to 
the throws. In speaking of “the probability of getting six points on the 
upper side of a die when throwing” and in trying to apply the definition 
of probability, we may have various things in mind. 

(a) We may think of a set of 100 throws already carried out. Then 
there will be no difficulty in calculating the probability required. 

(b) We may think of a set of some 100 future throws. In that case the 
probability required, say P{six}, will be just unknown. To establish its 
value, we should carry out the throws and count the cases with “six.” 

(c) Finally we may have in mind some hypothetical series of throws and 
discuss various probabilities referring to it. Usually such discussions con- 
sist in deducing values of one or more probabilities from the assumed hypo- 
thetical values of others. Some examples of such discussions will be found 
later. 

Of the three ways of interpreting the ambiguously stated problem con- 
cerning the probability of getting “six” on a die when throwing, the last is 
the most fruitful. We shall see this a little further on when I shall quetels 
of the so-called empirical law of large numbers. 

Example 3.—Consider the familiar expansion 7 = 3.14159 --- and denote 
by Zioo0 its thousandth decimal. What is the probability P{21000 = 5} of 
its being equal to 5? Here the question is not ambiguous and the answer 


2“Sure property” is an English adaptation of the French phrase, “propriété certaine,” 
as introduced by Maurice Fréchet and used in similar contexts. 
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is immediately found: the value of the probability P{x1000 = 5} is actually 
unknown, but it is certainly either zero or unity. In fact, there is but 
one object satisfying the definition of 21000. Therefore, the fundamental 
probability set consists of only one element and thus the denominator 
in the right hand side of equation (1) is equal to unity. The numerator 
may be equal to unity—this if 2999 1s actually equal to 5—or to zero, if 
X1000 18 not equal to 5. As the decimals in the expansion of 7 are known 
only to 707 places, 21990 is unknown and therefore we do not know whether 
P{2X1000 = 5} is zero or unity. 

As I have mentioned before, probabilities may refer to some hypothetical 
probability sets, with assumed properties. This case is the one with which 
the theory is most often concerned, and is of extreme importance. There- 
fore I shall give two illustrations. 

Example 4.—Consider a set F; of n die tosses, and denote by F2 the set 
of %4n(n — 1) different pairs that may be formed out of them, no element 
to be repeated in a pair. If certain properties of the set Ff, are given we 
may calculate the probability, say P{six, six | /2}, of a pair of throws with 
two “sixes,” referring it to Fz as the F.P.S. The property of F; that is , 
needed for the calculation of P{six,six| 2} consists in the probability 
P{six | F;} of getting a six in one throw. Assume, for instance, that 


P{six | Fi} = +. (2) 


This would mean that among the n throws in F, there are exactly n/6 
with six on the top face of the die, from which we could conclude that ; 
among the 4n(n — 1) pairs of throws forming F, there are exactly 


seh (<n — 1) Ss “ 2 (3) 


such pairs that consist of two “sixes,” and therefore that the probability 
n—6 


P{six, six | Fg} = —————_-- 
{ | Fo} 36(n — D 


(4) 

It will be seen that the above result is purely hypothetical: if the con- 
nection between F; and fF’, is as described above, and if the probability of 
a.specified property (“six”) calculated with regard to F; is 1/6, then the 
probability P{six, six | F2} = (n — 6)/36(n —1). Thus, zf the probability 
set Fy has the properties as specified in the conditions of the problem, then 
formula (4) holds good. We may notice at this stage that the properties 
of a probability set F, relevant for the calculation of probabilities may be 
given indirectly by specifying certain properties of some other set F, (or 
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of many other such sets), and by describing the connection between F, and 
F,. A similar situation prevails in the following example. 

Example 5.—Consider a series of n hypothetical experiments and assume 
that each of these experiments results either in an event £ or in a failure 
to produce #, described as non-H. Assume further that a separate prob- 
ability set is connected with each of the experiments, each set consisting of 
the same number m of elements and denote by F/ the set corresponding to 
the 7th experiment, 7 = 1, 2, ---, m. Suppose that whatever be 2, the prob- 
ability of the event # calculated with regard to F;’ is the same, that is, 


P{E| Fi} =p. (5) 


We may now consider still another probability set, say Fo, the elements 
of Fy being all possible combinations of elements of the sets Fy’, Fo’, +++, Fr’ 
taken n at a time, where each element in the combination is selected from 
a different set. If each of the sets Fy’, Fo’, :-:, F,’ consists of the same 
number m of elements, then the set Fo will consist of m” elements. 

The assumed properties of the sets Fy’, Fs’, -+-, Fn’ and their connection 
with Fy permit the calculation of various probabilities referring to Fo. For 
instance we may calculate the probability, say P, x, which frequently is 
picturesquely described as the probability of getting an event EF exactly 
k times in the course of n independent trials, the probability of H in each 
trial being permanently equal to ». This probability is easy to calculate 
and is known to be equal to 


! 
Prip = eTN NUD oe. (6) 
But it is important to know what this formula denotes. This probability 
P,,x is no more and no less than the proportion of elements of the set Fo that 
have the desired property of k “events” H and n — k “events” non-E. 

Again in this example, the calculation of the probability P,,;, referring to 
the probability set Fo was based on probabilities referring to the sets Fy’, 
F./, «++, F,’ and on the structure of elements of Fo, each of them being 
composed of elements of Fy’, Fo’, +++, Fr’. 

This is a typical situation and it will be convenient to introduce special 
terminology for its description. If the elements of any probability set Fo 
are combinations of those of some other sets F1, Fs, etc., then we shall say 
that the set Fo is of a higher order than the sets Fi, Fo, -:-. Thus we 
may distinguish probability sets of first, second, third, etc. order. 

In Example 4 the set F; is of first, and the set /2 of second order. In 
Example 5 the sets Fy’, Fs’, +--+, Fn’ are of first order and the set Fo of the 
second. It is easy to construct examples in which there will be probability 
sets of three or more successive orders. 
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In what I have just said I used the expressions “experiments,” “results,” 


“events,” which were not directly involved in the definition of probability. 
I want to emphasize that these expressions are no more than a picturesque 
description of fundamental probability sets and that if purity of language 
really were demanded, they should not be used. However, these and 
similar expressions are very frequent in all works on probability. They 
were established in olden days when the point of view regarding prob- 
ability theory was somewhat different. We hold on to them now because 
of their convenience. This point will be discussed later when I shall speak 
of applications and of the law of large numbers. 

We shall notice now that a description of a conceptual experiment, as in 
the above examples, amounts really to a description of probability sets. 
As the sets were classified, so will be classified the corresponding hypo- 
thetical experiments. Therefore we shall speak of experiments of the first, 
second, third, -:- order. 

In order to clear away any possible misunderstanding, let us consider 
again the probability sets involved in the last two examples, and illustrate 
them graphically. The set fF; of Example 4 may be represented by the 
use of the letter s for “six,” and the letter r for “not-six.” With n = 12, 
we might have the following picture: 


The numbers 1 to 12 below the line represent the ordinal numbers of the 
elements of Fj. 

To represent Ff, diagrammatically it will be convenient to use two dimen- 
sions. Each element of F2 is represented by rr, rs, sr, or ss. The rectangular 
coordinates x and y of an element of Ff, are equal to the ordinal numbers 
of the two elements of Ff; making up this element of Fy. As x can never 
be equal to y, i.e., no element of F; is to be repeated, it is permissible to 
take x > y. There will be only one element of F2 possessing the property 
“six-six”’ (ss), that composed of the eleventh and twelfth elements of F. 
It may be seen from Figure 1 that the number of elements forming F, is 
66 and that, therefore, P{six, six | F2} = %g6, which agrees perfectly with 
formula (4) above, if n therein be set equal to 12. 

We may now illustrate the connection between the probability sets Fo 
and Fy’, Fo’, +++, Fx’ of Example 5. Let us put k = n = 2, m = 6, p = 1/6, 
so that among the six elements forming either Fy’ or Fs’ there will be only 
one possessing the property H, the other five, denoted by G, being non-H. 
Let E in both sets be the 6th element. Any element of Fo is formed by 
combining an element of Fy’ with some element of F'.’: Therefore, it will 
be convenient to represent each element of Fo by a point on a plane whose 
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FIGURE 1 
x 
11 ss 
10 sr sr 
9 IT sr sr 
8 IT rr sr sr 
7 IT rr rr sr sr 
6 IT IT rr rr sr sr 
5 Ir rr ide rr IT sr sr 
4 IT IT Ir IT IT. rr sr sr 
3 IT IT IT Ir IT IT TT sr sr 
2 IT Ir IT rT IT rr Ir IT sr sr 
1 rr Ir IT IT IT rT IT IT IT sr sr 
1 2 3 4 5 6 7 8 9 10 11 12 x 


coordinates x and y are equal to the ordinal numbers of the elements of 
Fy’ and F.’, the combination of which produces the element of Fo under 
consideration (see Figure 2). All the elements of Fo possess the required 
property of being composed of elements of F;’ and F,.’, but only one of 
the 36 is HE. The resulting probability Po. = 46 is in agreement with 
the binomial formula (6). 


FIGURE 2 

Vi 

6 | GE GE GE GE GE EE 
5 | GG GG GG GG GG EG 
41GG GG GG GG GG EG 
3 | GG GG GG GG GG EG 
2!1GG GG GG GG GG EG 
1| GG GG GG GG GG EG 


1 2 3 4 5 6 x 


I hope that it is not necessary to insist that the above results, namely, 


P{EE | F} = P{six, six| Fo} = gy (Ex. 4) (7) 
and 
P{EE | Fo} = P{six, six| Fo} = gg (Ex. 5) (8) 


do not represent any sort of paradox. Both probabilities are calculated 
correctly and they differ only because they refer to different probability 
sets, Fz and Fo. This emphasizes the fact that probabilities refer to prob- 
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ability sets and that failure to specify the probability set properly may, 
and usually does, cause misunderstanding. 

Example 6.—The inclusion of the present example is occasioned by cer- 
tain statements of Harold Jeffreys? which suggest that, in spite of my 
insistence on the phrase, “probability that an object A will possess the 
property B,” and in spite of the five foregoing examples, the definition of 
probability given above may be misunderstood. 

Jeffreys is an important proponent of the subjective theory of probability 
designed to measure the “degree of reasonable belief.” His ideas on the 
subject are quite radical. He claims‘ that no consistent theory of prob- 
ability is possible without the basic notion of degrees of reasonable belief. 
His further contention is that proponents of theories of probabilities alter- 
native to his own forget their definitions “before the ink is dry.”*®> In 
Jeffreys’ opinion, they use the notion of reasonable belief without ever 
noticing that they are using it and, by so doing, contradict the principles 
which they have laid down at the outset. 

The necessity of any given axiom in a mathematical theory is something 
which is subject to proof. For example, it was possible to prove that many 
of the theorems taught for decades in calculus depend on the famous axiom 
of Zermelo which by itself seems very doubtful to many mathematicians. 
The method of proof is as follows: One assumes that a given theorem is 
true and then deduces that the axiom subject to doubt must be true also. 

However, Dr. Jeffreys’ contention that the notion of degrees of reasonable 
belief and his Axiom 1° are necessary for the development of the theory 
of probability is not backed by any attempt at proof. Instead, he considers 
definitions of probability alternative to his own and attempts to show by 
example that, if these definitions are adhered to, the results of their appli- 
cation would be totally unreasonable and unacceptable to anyone. Some 
of the examples are striking. On page 300, Jeffreys refers to an article of 
mine’ in which probability is defined exactly as it is in the present volume. 

Jeffreys writes: 


The first definition is sometimes called the “classical” one, and is stated in much 
modern work, notably that of J. Neyman. 


8 Harold Jeffreys: Theory of probability. Clarendon Press, Oxford, 1939, vi + 380 pp. 

4 Jeffreys, op. cit., p. 300. 

5 Jeffreys, op. cit., p. 303. 

6“Given p, q is either more or less probable than r, or both are equally probable; 
and no two of these alternatives can be true.” Jeffreys, op. cit., p. 16. 

7J. Neyman: “Outline of a theory of statistical estimation based on the classical 
theory of probability.” Phil. Trans. Roy. Soc. London, Ser. A, Vol. 236 (1937), pp. 
333-380. 
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However, Jeffreys does not quote the definition that I use but chooses 
to reword it as follows: 


If there are n possible alternatives, for m of which 7: is true, then the probability of 
p is defined to be m/n. 


He goes on to say: 


The first definition appears at the beginning of De Moivre’s book (Doctrine of 
Chances, 1738). It often gives a definite value to a probability; the trouble is that the 
value is one that its user immediately rejects. Thus suppose that we are considering 
two boxes, one containing one white and one black ball, and the other one white and 
two black. A box is to be selected at random and then a ball at random from that box. 
What is the probability that the ball will be white? There are five balls, two of which 
are white. Therefore, according to the definition, the probability is 2/5. But most 
statistical writers, including, I think, most of those that professedly accept the definition, 
would give (%):(44) + (4): (4) = %2. This follows at once on the present theory, 
the terms representing two applications of the product rule to give the probability of 
drawing each of the two white balls. These are then added by the addition rule. But 
the proposition cannot be expressed as the disjunction of five alternatives out of twelve. 
My attention was called to this point by Miss J. Hosiasson. 


The solution, 2/5, suggested by Jeffreys as the result of an allegedly 
strict application of my definition of probability is obviously wrong. The 
mistake seems to be due to Jeffreys’ apparently harmless rewording of the 
definition. If we adhere to the original wording and, in particular, to the 
phrase “probability of an object A having the property Bb,” then, prior to 
attempting a solution, we would probably ask ourselves the questions: 
“What are the ‘objects A’ in this particular case?” and “What is the 
‘property B, the probability of which it is desired to compute?” Once 
these questions have been asked, the answer to them usually follows and 
determines the solution. 

In the particular example of Dr. Jeffreys, the objects A are obviously 
not balls, but pairs of random selections, the first of a box and the second 
of a ball. If we like to state the problem without dangerous abbreviations, 
the probability sought is that of a pair of selections ending with a white 
ball. All the conditions of there being two boxes, the first with two balls 
only and the second with three, etc., must be interpreted as picturesque 
descriptions of the F.P.S. of pairs of selections. The elements of this set 
fall into four categories, conveniently described by pairs of symbols (1, w), 
(1,6), (2,w), (2,6), so that, for example, (2,w) stands for a pair of 
selections in which the second box was selected in the first instance, and 
then this was followed by the selection of the white ball. Denote by 
N1,w, 21,0) N20 and ne, the (unknown) numbers of the elements of F.P.S. 
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belonging to each of the above categories, and by n their sum. Then the 
probability sought is 


P{w | pair of selections} .= ee (9) 
The conditions of the problem imply 
Pe 1 
P{1| pair of selections} = Moe hg (10) 
1 
P{2| pair of selections} = et at (11) 
P{w | pair of selections beginning with box No. 1} = Se \, (12) 
Uw tm, 2 
° ° ° . ° N2,w Z 
P{w | pair of selections beginning with box No. 2} = —————_=-- (13 
{w|p ginning + ae a 
It follows 
Mw = ¥(M1,0 + 1,8) = EN, (14) 
Ne,w = ¥(Me2,w + Ne,b) = EN, (15) 
P{w | pair of selections} = 355. (16) 


The method of computing probability used here is a direct enumeration 
of elements of the F.P.S. For this reason it is called the “direct method.” 
As we can see from this particular example, the direct method is occasion- 
ally cumbersome and the correct solution is more easily reached through 
the application of certain theorems basic in the theory of probability. These 
theorems, the addition theorem and the multiplication theorem, are very 
easy to apply, with the result that students frequently manage to learn the 
machinery of application without understanding the theorems. To check 
whether or not a student does understand the theorems, it is advisable to 
ask him to solve problems by the direct method. If he cannot, then he 
does not understand what he is doing. 

Checks of this kind were part of the regular program of instruction in 
Warsaw where Miss Hosiasson was one of my assistants. Miss Hosiasson 
was a very talented lady who has written several interesting contributions 
to the theory of probability. One of these papers ® deals specifically with 


8 Janina Hosiasson: “Quelques remarques sur la dépendance des probabilités a pos- 
teriori de celles a priori.” C.R., Premier Congres des Math. des Pays Slaves, Warszawa, 
1929, pp. 375-382. 
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various misunderstandings which, under the high sounding name of para- 
doxes, still litter the scientific books and journals. Most of these para- 
doxes originate from lack of precision in stating the conditions of the 
problems studied. In these circumstances, it is most unlikely that Miss 
Hosiasson could fail in the application of the direct method to a simple 
problem like the one described by Dr. Jeffreys. On the other hand, I can 
well imagine Miss Hosiasson making a somewhat mischievous joke. 

Some of the paradoxes solved by Miss Hosiasson are quite amusing. 
The facility with which one is able to resolve these paradoxes may serve 
as a test as to whether or not the definition of probability is properly 
understood. The following paradox is taken from the “Treatise on Prob- 
ability” by J. M. Keynes (London, 1921, p. 378). Like Dr. Jeffreys, Lord 
Keynes was also a proponent of the subjective theory of probability. 

Consider an urn U of which it is known that it contains exactly n balls. 
About the color of the balls no information is available. Denote by m the 
number of black balls in the urn. Because of the complete lack of infor- 
mation as to the color of the balls and since there are »+ 1 possible 
hypotheses about the value of m, namely m = 0, 1, 2, ---, n, the subjective 
theory of probability ascribes to each of these hypotheses the same prob- 
ability, namely 1/(n +1). Granting this, it is easy to show that the 
probability, say P(B) that a ball drawn from the urn will be black is 
P(B) = %. This conclusion, by itself, is not questioned. However, Lord 
Keynes seems to have been puzzled by the circumstance that what applies 
to black balls should equally apply to white balls and yellow balls. 
Therefore, if we denote by P{W} and P{Y} the probabilities that the 
ball drawn will be white and that it will be yellow, respectively, then 
Pye ty PLB P= 1. 

Further, since the colors white, yellow and black are exclusive, the prob- 
ability that the ball drawn will be either black, white or yellow would 
appear to have the absurd value P{B + W-+ Y}=1.5. How come? The 
reader may wish to try to resolve this “paradox” on his own. If he does 
not succeed, then he may find it interesting to consult the paper of Miss 
Hosiasson. : 

2. More GENERAL DEFINITION OF PROBABILITY. The foregoing definitions 
and examples are perhaps sufficient to explain the basic ideas underlying 
the theory of probability when the fundamental probability set is finite. 
Let us now turn to the more general case and assume that the F.PS., say 
(A), is anything, finite or infinite. As formerly, let us denote by (B) the 
set of elements of (A) that have some distinctive property B. 

The definition of probability I am going to give will apply only to cer- 
tain sets (A) and to certain properties B, not to all possible ones. In fact, 
we shall require that the following postulates be satisfied by the class of 
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subsets (B) of A which correspond to the properties B for which the prob- 
ability will be defined. This class will be denoted by ((B)). 
It will be assumed 


(1) that the class ((B)) includes (A) so that (A) is an element of ((B)). 
(2) that for the class ((B)) it is possible to define a single-valued function 
m(B), called the measure of (B), wherefore the sets (B) belonging 
to the class ((B)) will be called measurable. The assumed prop- 
erties of the measure are as follows: 
(a) Whatever be (B) of the class ((B)), m(B) = 0. 
(b) If (B) is empty (does not contain any element at all), then it is 
measurable and m(B) = 0. 
(c) The measure of (A) is greater than zero. 
(d) If (B;), (Bo), ---, (Bn), +++ is any at most denumerable set of 
measurable subsets, then their sum, (2B;), is also measurable. 
If no two subsets (B;) and (B;) (where 7 ¥ 7), have common 
elements, then m(2B;) = Zm(B;). 
(e) If (B) is measurable, then the set (B) of objects A not possessing 
the property B is also measurable and consequently, owing to 
(d), m(B) + m(B) = m(A). 


Under the above conditions the probability, P{B | A} of an object A 
having the property B will be defined as the ratio 


The probability P{B|A}, or P{B} for short, may be called the abso- 
lute probability of the property B. Denote by B,Bz the property of A 
consisting in the presence of both B, and By. It is easy to show that 
if (B,) and (Bz) are both measurable, then (B,B2) will be measurable 
also. If m(Bz) > 0 then the ratio, say 


m(B, Be) 


P{By) Beh = Tas : 
2 


will be called the relative probability of B, given By. This definition of 
the relative probability applies when the measure m(Bz) as defined for 
the fundamental probability set (A) is not equal to zero. If, however, 
m(B2) = 0, but we are able to define some other measure, say m’, applicable 
to (Bz) and to a class of its subsets including (B,Bz) such that m’(B2) > 0, 
then the relative probability of By, given Bz will be defined as 
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m’ (BiB) 


Pi Bal Baie mn (Bo) 
2 


Whatever may be the case we shall have 
P{B,B,\| = P{B,}P{Bo| B,} = P{Bo}P{B, | Bo}. (17) 


It is easy to see that if the fundamental probability set is finite, then 
the number of elements in any of its subsets will satisfy the definition of 
measure. On the other hand, if (A) is the set of points filling up a certain 
region in n-dimensional space, then the measures of Borel and of Lebesgue 
will satisfy the definition used here. 

If the objects A are not points (e.g., if they are certain lines, etc.), the 
above definition of probability can still be applied, provided it is possible 
to define a measure over a class of subsets of (A). One way of achieving 
this, which is frequently applicable, is to establish a one-to-one correspond- 
ence between the objects of (A) and some other objects (A’) for which a 
measure has already been defined. If (B’) is any measurable subset of 
(A’) and (B) the corresponding subset of (A), then the measure of (B) 
can be defined to be equal to that of (B’). 

If a one-to-one correspondence between (A) and (A’) can be established 
at all, then it usually will be easy to establish it in more than one way and 
each definition of correspondence between objects A and objects A’ will 
imply, or as one occasionally says, induce a new definition of measure for 
subsets of (A). This, for instance, is the case when the objects A are chords 
in a circle C of radius r and objects A’ points in a plane. It may be useful 
to consider two of the possible ways of establishing a one-to-one corre- 
spondence between the chords and the points leading to two different defini- 
tions of measure of the subsets of chords. Specifically, we will discuss the 
so-called Bertrand’s problem which consists in determining the probability 
that a chord drawn “at random” in the circle C will have its length 2h 
greater than some specified value 2k < 2r. 

(i) Denote by x the angle between a fixed direction and the radius per- 
pendicular to any given chord A, in a circle of radius r. Further, let y 
be the perpendicular distance of the chord A from the centre of the circle C. 
Now let A’ denote a point on the zy plane with coordinates x and y; then 
there will be a one-to-one correspondence between the chords (A) of length 
0<2h<s2r and the points of a rectangle, say (A’), defined by two pairs 
of conditions [(0=2 < 7) (OSysr)] and [(tSx< 27) (O< ysr)]. The 
class of measurable subsets of chords may now be defined to be com- 
posed of all such subsets which correspond to subsets of (A’) that are 
measurable in the sense of Borel. This includes the subset (AB) of chords 
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with lengths 2h > 2k. In fact, these chords correspond to points, say A’B’ 
in (A’) with their coordinate y <~+~/r? —k?. The set of points (A’B’) 
fills in a rectangle (apart from some points on the boundary) and its Borel 
measure is equal to the area of this rectangle, namely 2aV/r? — k?. It 
follows that the probability in which we are interested is P{h >k} = 
V1.— (k/r)*, 

(ii) Denote by x and y the angles between a fixed direction and the radii 
pointing towards the two ends of a given chord A. If A” denotes a point 
on a plane with coordinates x and y, then there exists a one-to-one corre- 


Figure 3 
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Solution 1.—Here the set (A) of chords is mapped on the rectangle (A’), the correspond- 
ence between chords and points in (A’) being a one-to-one correspondence. 





spondence between the chords of the set (A) and the points within the 
parallelogram (A”) (see Figure 4) determined by the two pairs of condi- 
tions [(OSu%< am) (tS ysur44+-n)] and [(rSx< 2a) (tSy<xrin)]. 
(A,”) is a subset of (A”) which is measurable in the sense of Borel and 
if (A,) is the corresponding subset of chords, then define (A;) as measur- 
able and let the measure m(A;) be equal to the Borel measure of (A,”). 
The points in (A”) which correspond to chords with lengths exceeding 2k 
lie above the dotted line y = x + 2 are sin k/r. Since these points fill in 
a parallelogram, the set is measurable and its Borel measure coincides with 
the area of the parallelogram, namely 27(7 — 2 are sin k/r). Since the 
measure of the entire set (A) is equal to that of the entire set (A”) which 
is 277, it follows that the probability P{h > k} = 1 — (2/z) arc sin k/r. 

It is seen that the two solutions differ and it may be asked which of 
them is correct. The answer is that both are correct, but that they corre- 
spond to different conditions of the problem. In fact the question “what 
is the probability of a chord having its length greater than 2k” does not 
specify the problem entirely. This problem is only determined when we 
define the measure appropriate to the set (A) and the subsets of (A) to 
be considered. We may describe this differently, using the terms “random 
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experiments” and “their results.” We may say that to have a problem 
of probability determined, it is necessary to define the method by which 
the randomness of an experiment is attained. Describing the conditions of 
the problem concerning the length of a chord that lead to the first solution 
(Figure 3), we could say that in selecting at random a chord A, we first 
pick at random the direction of a radius, all directions being “equally 
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Solution 2.—Here the set (A) of chords is mapped on the parallelogram (A’’), the cor- 
respondence between chords and points in (A’’) being a one-to-one correspondence. 


probable,” and then, also at random, we select the distance between the 
centre of the circle and the chord, all values between zero and r being 
“equally probable.” It is easy to see what would be the description in the 
same language of the random experiment leading to the second solution 
(Figure 4). 

We frequently use this way of speaking, but it is necessary to remember 
that behind such words, as e.g., “picking at random a direction, all of them 
being equally probable,” there is a definition of the measure appropriate 
to the fundamental probability set and its subsets. I want to emphasize 
that in all my writings a phrase like the previous one in quotation marks is 
no more than a way of describing the fundamental probability set and its 
appropriate measure. The concept “equally probable” is not in any way 
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involved in the definition of probability adopted and it is a pure conven- 
tion that the statement 


“For the purpose of calculating the 
probabilities concerning chords in a 
circle, the measure of any set (A) of 
chords is defined as that of the set 
(A’) of points, each with coordinates 
x and y and such that for any chord 
A in (A), z is the direction of the 
; radius perpendicular to A and y the 
FRAN aR ely tee io distance of A from the centre of the 


zero eae r being equally circle. (A) is measurable only if 
probable. (A’) is so.” 


“Tn picking a chord at ran- 

dom, we first select a direc- 

tion, all directions being 

equally probable; and then] Means no 
we choose a distance bet-|more and 
tween the centre of the cir-| no less 
cle and the chord, all values| than 


However free we are in mathematical work to use words that we find 
convenient as long as they are clearly defined, our choice must be justified 
in one way or another. The justification for speaking of the definition of 
measure within the fundamental probability set in terms of imaginary 
random experiments lies in the empirical fact which Bortkiewicz ® insisted 
upon calling the “law of large numbers.” This law says that, given a 
purely mathematical definition of a probability set including the appro- 
priate measure, we are able to construct a real experiment, possible to carry 
out in any laboratory, with a certain range of possible results and such 
that if it is repeated many times, the relative frequencies of these results 
and their different combinations in small series approach closely the values 
of probabilities as calculated from the definition of the fundamental prob- 
ability set. Examples of such real random experiments are provided by 
the experience of roulette,° by the experiment of throwing a needle™ so 
as to obtain an analogy to the problem of Buffon, and by various sampling 
experiments based on Tippett’s random numbers.!? 

These examples show that random experiments corresponding in the 
sense described to mathematically defined probability sets are possible. 
However, frequently they are technically difficult. E.g., if we take any 
coin and toss it many times, it is very probable that the frequency of heads 
will not approach 1/2. To get this result we must select what could be 
called a well-balanced coin and we must work out an appropriate method 


®L. von Bortkiewicz: Die Iterateonen. Julius Springer, Berlin, 1917, x + 205 pp. 

10 Bortkiewicz, loc. cit. : 

11 This is mentioned by E. Borel, Eléments de la Théorie des Probabilités, Herma! 
Paris, 1909, vii + 205 pp. Cf. p. 106. 

121,, H. C. Tippett: “Random sampling numbers.” Tracts for Computers, No. XV, 
Cambridge University Press, 1927, viii + 26 pp. 
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of tossing. Whenever we succeed in arranging the technique of a random 
experiment, such that the relative frequencies of its different results in 
long series approach, sufficiently in our opinion, the probabilities calculated 
from a fundamental probability set (A), we shall say that the set ade- 
quately represents the method of carrying out the experiment. 

We shall now draw a few obvious but important conclusions from the 
definition of probability which we have adopted. 

(1) If the fundamental probability set consists of only one element, any 
probability calculated with regard to this set must have the value either 
zero or unity. 

(2) If all elements of the fundamental probability set (A) possess a 
certain property Bo, then the absolute probability of Bo, and also its relative 
probability, given any other property B;, must be equal to unity, so that 
P{Bo | A} = P{Bo} = P{By| Bi} =1. On the other hand, if it is known 
only that P{Bo} = 1, then it does not necessarily follow that P{Bo | B1} 
must be equal to unity. 

3. RANDOM VARIABLES. We may now proceed to the definition of a ran- 
dom variable. We shall say that x is a random variable if it is a single- 
valued measurable function (not a constant) defined within the funda- 
mental probability set (A) with the exception perhaps of a set of elements 
of measure zero. We shall consider only cases where x is a real numerical 
function. If x is a random variable, then its value corresponding to any 
given element A of (A) may be considered as a property of A, and what- 
ever the real numbers a < b, the definition of (A) will allow the calcula- 
tion of the probability, say P{asxz < b} of x having a value such that 
Gs <0, 

We notice also that as x is not constant in (A), it is possible to find at 
least one pair of elements, A; and Az of (A), such that the corresponding 
values of x, say %1 < %2 are different. If we denote by B the property 
distinguishing both A; and A», from all other elements of (A), andifa<b 
are two numbers such that a<24<6< %, then P{fasx<b|B} = %. 
It follows that if x is a random variable in the sense of the above defi- 
nition, then there must exist such properties B and such numbers a < 6 
Bane Piasa — >) 5) 1. 

It is obvious that the above two properties are equivalent to the definition 
of a random variable. In fact, if x has the properties (a) that whatever 
a <b the definition of the fundamental probability set (A) allows the 
calculation of the probability P{as<z <b}, and (b) that there are such 
properties B and such numbers a < b that 0 < P{asz < b| B} <1, then 
x is a random variable in the sense of the above definition. 

The probability P{a<x < b} considered as a function of a and 6 will 
be called the integral probability law of x. 
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A random variable is contrasted with a constant, say 6, the numer- 
ical values of which, corresponding to all elements of the set (A), are 
all equal. If @ is a constant, then whatever a < b and B, the probability 
P{as0@< b|B} may have only values unity or zero according to whether 
6 falls in between a and b or not. 

If we keep in mind the above definitions of the variables in our discus- 
sions of them, we may speak in terms of random experiments. In the sense 
of the convention adopted previously, we may say that x is a random vari- 
able when its values are determined by the results of a random experiment. 

It is important to keep a clear distinction between random variables 
and unknown constants. The 1000th decimal, Xj009, in the expansion of 
a = 3.14159 +--+ is a quantity unknown to me, but it is not a random 
variable since its value is perfectly fixed, whatever fundamental probability 
set we choose to consider. We could say alternatively that the value of 
X1i000 does not depend upon the result of any random experiment. 

Frequently we have to consider simultaneously several random variables 


U1, %2, °° *, In (18) 


and their simultaneous integral probability law, to be defined as follows. 

Denote by £ the set of values of the n variables (18). This set could 
be represented by a point which will be called the sample point # in an 
n-dimensional space, say W, the rectangular coordinates of the point H 
being the values 21, Yo, +++, Yn. The space W will be called the sample 
space. Denote by w any region in W and accept the convention that H ew 
stands for the words: “the point # is an element of w.” 

If the x; are random variables, then whatever be w, we may speak of 
the probability of # being an element of w, and denote it by P{H ew}. 
In fact this probability will be represented by the ratio of the measure of 
that part, say F(w), of the F.P.S. in which the x; have values locating the 
point # within the boundaries of w to the measure of the F.PS. itself. Of 
course, it must be assumed that w is measurable. With that restriction 
the probability, P{H « w}, is defined for every region w. This probability, 
considered as a function of the region w, is called the sumultaneous integral 
probability law of the 2. 

Apart from, or instead of, the integral probability law we may frequently 
consider another function called the elementary probability law of the ran- 
dom variables. This is defined as follows. 

If P{E « w} stands for the integral probability law of the variables (18), 
and if there exists a function p(Z) of the x; such that whatever be w, for 
which the probability P{H ew} exists, 


P{E ew} -{f- . fr@ dxy, dg +++ Ain, (19) 
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then the function p(E) is called the elementary probability law of the 
random variables (18). 

Remark: The terms “integral probability law’ and “elementary prob- 
ability law” were introduced in the 1920’s by the noted French probabalist, 
Paul Lévy. In more recent times they are being partially replaced by 
“distribution function” and “probability density function,” respectively. 

It will be noticed that while the integral probability law is a function 
of the region w, the elementary probability law is a function only of the 
point H. It will be noticed also that p(H) may be considered as being 
defined in the whole sample space and non-negative. Of course there are 
cases where no elementary probability law in the above sense exists; this, 
however, happens rarely in problems of statistics. 

It is important to know a few simple rules of dealing with elementary 
probability laws. 

(i) If p(x, v2, +++, Xn) and p(%1, 2, ***, X»—1) are the elementary prob- 
ability laws of 


U1, %2, *°*, Xn—1, Un 
and (20) 
1, U2, °° *,) In—1 
respectively, then 
re) 
p(x, wai * oy cet) =| p(x, U2, °° *, Un—-1; ia) omy (21) 


This rule permits the calculation of the elementary probability law of any 
single one of the 2; whenever their simultaneous probability law is known. 
(ii) If there are two sets of n random variables each, 


Rida pet in (22) 
and 
Vis Ua, 5 UA (23) 


such that each of the x; is a function of the y;, possessing continuous partial 
derivatives with regard to any y;, the Jacobian 
_ 8(21, Lz +++ Ln) 
5(y1, Y2°** Yn) 


existing and being different from zero almost everywhere and never chang- 
ing its sign, then the probability laws p(1, +--+, %) and p(y1, °**, Yn) of 
the variables (22) and (23) respectively, are connected by the identity 


(24) 


p(y, Yai ees Yn) aya p(x, U2, °°", 2n)| A | (25) 


where in the right-hand side the x; will ordinarily be expressed in terms of 
the Vi- 
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Combining the two above rules we may calculate the probability law of 
various functions, f(Z), of the x; whenever the simultaneous probability 
law of the latter is known. 

In order to clear the way for the material involved in the following lec- 
tures, I shall finish this one by giving definitions relating to statistical 
hypotheses. 

Consider the set of random variables 21, 22, ***, tn. Any assumption 
concerning their probability law (either integral or elementary) is called 
a statistical hypothesis. 

A statistical hypothesis is called simple if it specifies the integral prob- 
ability law, P{E ew} of the z; as a single-valued function of the region w. 

Any statistical hypothesis that is not simple is called composite. It may 
be useful to illustrate these definitions by some examples. 

The assumption H, that *% 





1 ae fesse pil 
px (E) a (=) Gum Z (xi —u)2/2 “i (26) 


where neither » nor o > 0 is specified, is a composite statistical hypothesis. 
In fact, if w denotes a region defined by the inequality 


. Derg < Ly 


ik * & 2 
P{Eew} = ( =) f. . fe Uri w)"/20" de, dita -+* Atn (27) 
ov 24 a 


is not uniquely determined but is a function of the parameters p and oa, 
which are left unspecified by the hypothesis H,. 

On the other hand, the assumption H»2 that the elementary probability 
law of the x; is as given by formula (26) but with » = 0 and oc =1 is 
already a simple hypothesis. In fact, whatever the region w in the sample 
space, substituting » = 0 and o = 1 in (27), we shall be able to calculate 
the unique numerical value of P{Hew}, although at times this may be 
connected with great technical difficulties. 


then 





Part 2. Probability and Experimentation 


1. ABSTRACT CHARACTER OF MATHEMATICAL THEORIES AND POSSIBILITIES OF 
APPLICATIONS. It is probable that many who listened to my first lecture 
were disappointed. They are engaged in applying probability to practical 
problems and such problems may be the only cause of their interest in the 


13 The sign &, unless accompanied by other indications, will signify summation over 7 
from) to-%s- 16. = 182) +> 4, 
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theory of probability. They may feel that they have no use for a theory 
which treats “experiments,” “results,” or in fact everything that is of the 
utmost importance to them only as picturesque descriptions of probability 
sets and measures. Theory of this kind may be good for mathematicians, 
they may say, but we want a mathematical theory dealing with actual 
experiments, not with abstract probability sets. 

It may be useful to start this lecture by considering more closely whether 
or not it is possible to satisfy that part of my audience which is of the 
opinion described. One might put the question this way: Is it possible to 
produce a mathematical theory dealing with actual experiments or, more 
generally, with phenomena of actual life? 

My answer is: Probably never. That is, unless the word mathematics 
changes its present meaning. The objects in a real world, or rather our 
sensations connected with them, are always more or less vague and since 
the time of Kant it has been realized that no general statement concerning 
them is possible. The human mind grew tired of this vagueness and con- 
structed a science from which anything that is vague is excluded—this is 
mathematics. But the gain in generality must be paid for, and the price 
is the abstractness of the concepts with which mathematics deals and the 
hypothetical character of the results: 7f A is B and-B is C, then A is also C. 

Of course, there are many mathematical theories that are successfully 
applied to practical problems. But this does not mean that these theories 
deal with real objects. If they did, they could not involve general state- 
ments and could not be considered as mathematical. Let us illustrate this 
by a few examples. Modern geometry is a mathematical science and is 
applied to practical problems. But does it deal with objects that we meet 
in actual life? Let us see. Geometry deals with such concepts as planes, 
straight lines, points, etc. Is there anything in real life that is exactly a 
plane in the sense of geometry? We say sometimes that the surface of a 
table is a plane. But if we look at the surface through a good magnifying 
glass we shall immediately see that it is certainly not a plane. If we say 
that it is, we mean that for practical purposes it could be considered a plane. 

Here we come to the essential point: when we apply mathematics to 
practical problems we never seek (and if we would, we should never suc- 
ceed) to find an identity between mathematical concepts and realities; we 
are satisfied if we find some correspondence between them, by which a 
mathematical formula can be interpreted in terms of realities and give a 
result which, for practical purposes, would in our opinion be sufficiently 
accurate. 

Consider a triangle 7, formed by three points on this sheet of paper. 
Divide it by straight lines into four smaller triangles 72, 73, T4 and 1's. 
If we state numerically the coordinates of all the vertices, we shall be able 
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to apply known formulas and calculate the areas of all the five triangles. 
Naturally, the area of 7; will be equal to the sum of the areas of the other 
four. This is geometry. But now take any instruments you desire and 
measure the sides of all the triangles as actually drawn. Using these meas- 
urements and again applying formulas we may be disappointed to find that 
the area of T; so calculated is not exactly equal to the sum of the areas 
of Ts, T3, T 4 and T's. 

It will be suggested that the discrepancy is due to errors of measurement. 
This is true so far as the expression “errors of measurements” stands for 
something broader, including the fact that the dots representing the vertices 
of the triangles are not the points we consider in mathematics. However, 
for many practical purposes the agreement between the area of TJ; and the 
sum of areas of Ts, T3, T'4 and T's will be judged satisfactory and this is 
the decisive point in the question of whether or not the mathematical theory 
of geometry can be applied in practice. 

A closer examination of other mathematical theories applied to practical 
problems will reveal the same features. The theory itself deals with abstract 
concepts not existing in the real world. But there are real objects that 
correspond to these abstract concepts in a certain sense, and numerical 
values of mathematical formulas more or less agree with the results of 
actual measurements. In the earlier stages of any branch of mathematically 
treated natural science we are satisfied with only a slight resemblance 
between mathematical and empirical results, but later on our requirements 
become more and more stringent. 

After this somewhat long general introduction we may turn to the main 
topic of this lecture which is whether, and if so, how the mathematical 
theory of probability can be usefully applied in natural science. 

2. RANDOM EXPERIMENTS AND THE EMPIRICAL LAW OF LARGE NUMBERS. It 
follows from what I said that the foundations of the theory of probability 
could be chosen in many ways. But however they are chosen, if their 
accuracy is on the level now customary in mathematics, the theory of 
probability will deal with abstract concepts and not with any real objects. 
Therefore, the application of such a theory will be possible only if one can 
establish a bridge or a correspondence between concepts of the theory and 
real facts. The actual applications must be preceded by numerous checks 
and rechecks of the permanency and the accuracy of such correspondence. 
If one judges this to be sufficiently accurate and finds it sufficiently perma- 
nent, then the predictions—the final aim of any science—based on the 
mathematical theory of probability, will have some prospect of success. 
Otherwise the theory may be interesting by itself, but useless from the 
point of view of application. 

What, then, is the class of facts that corresponds to concepts of the 
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theory of probability as described in my first lecture? What is the meaning 
of this correspondence? 

The class of such facts may be described as the results of random experi- 
ments. It is impossible to give an exact definition of experiments that are 
called random, but it would be equally impossible to give a definition of 
objects in the real world that deserve the description “plane,” “straight 
line,” etc. If we try to do so, we shall inevitably find ourselves speaking 
not of real objects but of abstract concepts. At most we can give a rough 
description of random experiments and some illustrations so as to appeal 
to the intuition. In what follows, unless otherwise stated, whenever I shall 
speak of experiments I shall mean real experiments, not hypothetical ones. 

There are experiments which, even if carried out repeatedly with the 
utmost care to keep conditions constant, yield varying results. They are 
“random.” 

(a) We may construct a special machine to toss coins. This machine 
may be very strong, driven by an electric motor so as to impart a con- 
stant initial velocity to the coin. The experiments may be carried on in a 
closed room with no noticeable air currents; the coin may be put into the 
machine always in the same way; and even then I am practically certain 
the results of the repeated experiments will vary. Perhaps frequently we 
may get heads, but from time to time the coin will fall tails. The experi- 
menter may be inclined to think that these cases arise from some “error of 
experimentation.” 

(b) Another example of this kind is provided by roulette. A well-con- 
structed roulette wheel with an electrically regulated start will yield varying 
results. 

(c) The above were types of random experiments arranged by men. But 
there are some going spontaneously. Consider a quantity of radioactive 
matter and the @ particles it emits in some specified direction within a 
cone of small solid angle. These particles could be recorded by the fluo- 
rescence they produce when falling on an appropriate screen. Let us 
observe this screen for several consecutive minutes, one minute’s observa- 
tion being considered as a single experiment. It will be found that how- 
ever constant be the conditions of the consecutive experiments, the results 
will vary in that the number of disintegrations recorded per minute will 
not be the same. 

(d) Another example of this kind is provided by the varying properties 
of organisms forming an F2 generation, however homogeneous be the con- 
ditions of breeding. 

These examples may make sufficiently clear what I mean by random 
experiments. Now I shall explain the sense in which their results correspond 
to concepts involved in the theory of probability. 
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Let N and n be positive integers, N fairly large, say 1000 or so, and 
n moderate, say 10. Let us perform a long series of Nn random experiments 
of the type described, and count cases where a certain specified result H 
occurred. Let it be in M cases. Dividing M by Nn we obtain the ratio 


= (1) 
Nn 
which will be called the relative frequency of the result EZ in the course of 
Nn trials. These Nn trials will be called experiments of the first order. 
Now divide the whole series of Nn first order experiments into N groups of 
n trials each in the order in which the trials were carried out. Each such 
group of n first order trials will now be considered as a trial of second order. 
The second order trials could be classified according to the number k of 
occurrences of the result EH in the n first order trials of which they are 
formed. Obviously k could be equal to 0, 1, 2, ---, m, in any one of the 
second order trials. Let m; denote the number of trials in which # occurred 
exactly k times, and 


Pip = (2) 
n,k N 
the relative frequency in the series of second order trials. 

It is a surprising and very important empirical fact that whenever suffi- 
cient care is taken to carry out the first order experiments under as uniform 
conditions as possible, and when the number N is large, then the relative 
frequency F,,, appears to be very nearly equal to the familiar formula 


———__ (1 — pif (3) 
(n — k) Ik! 

In other words, the relative frequency F,x relating to a series of second 
order experiments is connected with the relative frequency of the first 
order experiments in very nearly the same way as the probability P,, 
relating to the second order probability set, as discussed in my first lecture, 
is connected with the probability p referring to the corresponding first order 
probability set. 

In order to avoid misunderstanding, let us describe the situation in 
greater detail. Suppose that the random experiment under consideration 
consists in 2N throws of the same die and that f is the relative frequency 
of cases where the upper side of the die had six points on it. The value 
of f may be close to 1/6 or not. It may, in fact, differ considerably from 
1/6, depending on the structure of the die and the exact conditions of 
throwing. But if we split the whole series of trials into consecutive pairs, 
then the proportions of pairs with 0, 1 and 2 sixes will be, approximately, 
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The above fact, which has been found empirically + many times, could 
be described in a more general way by saying that single random experi- 
ments and the various groups of these experiments usually behave as if 
they tended to reproduce certain first order probability sets, corresponding 
to first order trials, and the appropriate second order probability set. This 
fact may be called the empirical law of large numbers. I want to empha- 
size that this law applies not only to the simple case connected with the 
binomial formula which was discussed above but also to other cases. In 
fact, this law seems to be perfectly general, in the sense in which we use 
the word general with respect to any other “general law” observed in the 
outside world. Whenever the law fails, we explain the failure by suspecting 
a “lack of randomness” in the first order trials. 

Suppose now that having repeatedly performed series of random experi- 
ments of some specified kind we have always found that they do conform 
to the empirical law of large numbers. Then, as is our custom, we expect 
them to behave similarly in the future, and we expect the calculus of prob- 
ability to permit us to make successful predictions of frequencies of results 
of future series of experiments. 

This is the way in which the abstract theory of probability described in 
my first lecture may be put into correspondence with happenings in the 
outside world and how it may be, and actually is, applied to solve problems 
of practical importance. The standing of the theory of probability is, in 
this respect, no different from any other branch of mathematics. The appli- 
cation of the theory involves the following steps. 

(i) If we wish to treat certain phenomena by means of the theory of 
probability we must find some element of these phenomena that could be 
considered as random, following the law of large numbers. This involves 
a construction of a mathematical model of the phenomena involving one 
or more probability sets. 

(ii) The mathematical model may be satisfactory or not. This must 
be checked by observation. 

(iii) If the mathematical model is found satisfactory, then it may be 
used for deductions concerning phenomena to be observed in the future. 

Let us illustrate these steps by a few examples taken from the current 
literature. 

3. IntustratTions. Example 1—Two bacteriologist friends of mine, Miss 
J. Supinska and Dr. T. Matuszewski, were interested in learning whether 
the calculus of probability could be applied to certain problems concerning 


1See, for example, L. von Bortkiewicz, Die Iterateonen, Julius Springer, Berlin, 1917, 
x + 205 pp. 
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the colonies of bacteria on a Petri-plate. The diagram reproduces a photo- 
graph of a Petri-plate with colonies that are visible as dark spots. 





You will notice that the plate is divided into a number of small squares. 
In order to explain the particular mathematical model that was tried in 
this instance, consider the contents v of one particular square and consider 
one particular living bacterium B contained in the liquid that was poured 
on the plate. In the mathematical model all the operations performed with 
the liquid and the plate which resulted in fixing the bacterium B in some 
point are considered as a first order experiment which may result either in 
B falling within v, or not. If there were N living bacteria in the liquid 
poured on to the plate, then there were N such first order experiments all 
relating to the same square v. They form a single second order experiment. 
Finally, if the number of squares in which the plate is divided be n, then 
there will be m second order experiments, which, taken together, could be 
considered as one third order experiment. Without going into further 
details of this mathematical model I shall state that it implies that the 
probability of any of the squares containing exactly k colonies must be 
approximately equal to the Poisson formula 
ey 

Py = ae (5) 
where A means the average number of colonies per square. The reader will 
notice that the above k satisfies the definition of a random variable the 
integral probability law of which is given by 
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k=a . 
If this mathematical model could be assumed to correspond accurately to 
the actual experiments in the sense explained above, then it could be used 
for predicting frequencies of certain circumstances that are important in 
bacteriology. One of the questions that my colleagues had in mind was 
how frequently a single colony is produced by two or more unconnected 

bacteria. 

In order to answer the question whether or not the number k of colonies 
within a square could be considered as a random variable whose prob- 
ability law could be represented by formula (5), my colleagues performed 
a series of experiments summarized in Table I. 

The values of k are the numbers of colonies within the squares into 
which the whole plate was divided. m’ and m denote respectively the 
observed and the expected numbers of squares having the number k of 
colonies. The last two lines give measures of the goodness of fit, the chi- 
square and the corresponding P. It is seen that without exception the 
agreement between the observed and the theoretical frequencies obtained 
by multiplying the P; of formula (5) by the total number of squares on 
the plate, is surprisingly good. As a matter of fact, the total number of 
similar experiments that have been carried out is much larger, and in not 
a single case has any serious disagreement between the distribution of 
colonies and the Poisson law been recorded. This entitles us to expect that 
the results of future experiments will be similar, and that conclusions con- 
cerning these future experiments drawn from the mathematical model 
described above, will be correct, or good enough. 

If the model implies that in a particular case the probability of a colony 
arising from more than one independently floating individual is for instance 
P = .001, we may conclude that about 99.9 percent of the colonies were 
produced by one individual only. 

For the sake of clearness I may mention that in the above statement 
“one individual” does not necessarily mean one cell. This expression refers 
to one or more cells that are floating together, being connected either 
mechanically or biologically. 

Example 2.—Table II is reproduced from an article in Biometrika, and 
represents a comparison between the Poisson law, formula (5), and the 
‘distribution of dodder in samples of clover seed. The problem and the 
mathematical model were similar to that treated above. 

The table gives 12 comparisons, of which eleven are based on material 
produced by Schindler and the last by the authors of the article, J. Praybo- 
rowski and H. Wilenski. It will be seen that the material as a whole is 
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TaBLe I 
Comparison of distribution of colonies with Poisson Law 


[T. Matuszewski, J. Supinska and J. Neyman, Zentralblatt fiir Bakteriologie, 
Parasitenkunde und Infektionskrankheiten. II. Abteilung, 1936, Bd. 95]. 














Plate 1 Plate 2 Plate 3 Plate 4 Plate 5 
k ee Se ee ee ee ree a ee a 
m! m m! m m’ m m’ m m' m 
0 5 6.1 26 2Tt5 59 55.6 83 75.0 0 Ot 
1 19 18.0 40 A422 86 82.2 134 144.5 5 3.9 
2 26 plied 38 B22. 49 60.8 135 | 139.4 9 11.0 
3 26 26.4 17 law's 30 30.0 101 89.7 23 20.9 
4 21 19.6 5 15 40 43.3 33 29.6 
5 13 Di 2 +9.1 3 |+15.4 16 16.7 32 34.0 
6 4 2 3 32 31.8 
7 f +9.5 2 +7.4 24 25.8 
8 1 2 13 18.3 
9 12 11.6 
10 8 6.7 
11 7 
12 2 +5.7 
x? 0.77 1.61 4.05 3.47 4.94 
Py 0.97 0.66 0.26 0.63 0.84 
Plate 6 Plate 7 Plate 8 Plate 9 Plate 10 
k —— ee a Se ee ee ene 
m’ m m!' m m’ m m! m m’ m 
Culians Bish {%O eel etka aw 3 | a | eon ieate 
1 16 16.2 12 ; 11 10.4 7 8.2 80 75.8 
2 18 19.2 18 16.7 11 1387 14 15.8 45 45.8 
3 15 15.1 13 22.4 alg | 12.0 21 20.2 16 18.5 
4 9 9.0 24 / 2237. 7 7.9 20 19.5 8 
5 4 19 18.3 3 19 15.0 1 +7.3 
6 2 16 WAS. 2 7 9.6 
7 0 +6.7 6 1 +7.1 6 
8 1 4 +13.3 if 1 
9 1 1 0 +9.6 
10 2 
x? 0.30 6.67 Seal 2).63 1.09 
Px? 0.97 0.25 0.53 0.85 0.78 
= number of colonies per square. 
m’' = observed frequency. 
m = expected frequency (Poisson). 
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TaBLeE II 


Comparison of the distribution of dodder seeds in samples of clover with Poisson Law * 


[J. Przyborowski and H. Wilenski, Biometrika, Vol. 27, 1935, p. 277] 

















Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 
k a 
Nx N-P; Nx N:P; Ni N:P; Ni N:P; Nx N:P; 
0 168 | 183.94 | 599 | 606.53 | 382 | 389.40 | 284 | 303.27 | 795 | 774.64 
1 205 | 188.94 | 315 | 303.27 | 111 97.35 | 170 | 151.63 | 941] 116.20 
2 94 91.97 | 74 75.82 Z 12.17% 39 37.91 11 8.71 
3 26 30.66 | 12 12.64 +1.08 i 6.32 0.45 
4 6 7.66 +1.74 +0.87 
5 1 1953 
Over 5 +0.30 
x 5.20 0.98 5.00 3.49 5ei3 
Py? 0.160 0.600 0.000 0.180 0.000 
Sample 6 Sample 7 Sample 8 Sample 9 Sample 10 
k SS eee ee ee eee 
NIOTING tle ere ten Eth ONT E  Ne Ne Pe ol oONa le OPS 
0 447 | 452.42 | 473 | 475.61 | 295 | 303.27 | 22 16.42 0 1.08 
1 51 45.24 | 26 23.78 | 158 | 151.63 | .29 41.04 3 5.59 
2 2 | 2.26 | +1); +0.61 | 44 3o7.91 | 55 9 es Os) 13.48 
3 +0.08 8 6.32 | 43 42.75 its) 22.46 
4 +0.87 | 34 26.72 | 33 28.07 
5 10 13.36 | 28 28.07 
6 3 5.57 | 24 23.40 
7 4 1.99 |} 21 yall 
8 0.85 | 10 10.44 
9 8 | 5.80 
10 +5 |(+5.10 
x? 0.85 0.47 Deol 8.76 7.04 
Py? 0.198 0.319 0.533 0.120 0.532 
k = number of dodder seeds in a sample. 


Nx = observed frequency. 
N-P; = expected frequency (Poisson). 
* Data for the first eleven samples are taken from Schindler’s experiments. 
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TaBLeE I[—Continued 





Authors’ own experi- 





Sample 11 ment with known 
A=2 
k 
Nx N:P; k “ Ni N:P; 
0 0 0.09 0 56 | 67.67 
1 0 0.66 1 156 | 135.34 
2 1 2.49 4 132 | 135.34 
3 4 6.22 3 92 | 90.22 
4 9 11.67 4 37 | 45.11 
5 16 17.60 5 22 | 18.04 
6 19 21.87 6 4 6.02 
fi 19 23.44 7 0 1.72 
8 26 21.97 8 1 0.43 
9 19 18.31 | Over 8 0 0.12 
10 15 13.73 
11 14 9.36 
12 5 5.85 
13 6 3.38 
14 3 1.81 
15 3 0.90 
Over 15 | +1 || +0.74 
x? 9.81 8.92 
P35 0.548 0.179 


not as satisfactory as in the preceding example. It seems to follow that 
if samples of clover seed are drawn by the method employed by Schindler, 
then conclusions concerning them drawn from the mathematical model 
involving the Poisson Law will not necessarily be very accurate. But it is 
possible that the method of drawing samples of seeds may be so adjusted 
(this is the opinion of Przyborowski and Wilenski) that the number of 
dodders in a small subsample of seeds could be considered rightly as a 
random variable following the Poisson Law. 

As mentioned above, if the outcomes of experiments or observations do 
not conform with the predictions of a mathematical model that is strongly 
suggested by intuition, then it is usual to ascribe the divergencies to “faults 
of experimentation.” This expression is vague, and if we try to make it 
more precise, we shall probably come to the description: “The random 
machinery of the observed phenomena does not correspond to the mathe- 
matical model assumed.” The situation can be remedied in two ways. 
One is to make an effort towards a better understanding of the phenomena 
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studied and therewith to modify the mathematical model. The other way 
is to modify the method of experimentation so as to bring it into con- 
formity with the original mathematical model. The possibility and desir- 
ability of these two methods depend on the circumstances of the problem. 
They are illustrated in the following two examples. 

Example 3.—Problems of pest control led to studies of the distribution 
of larvae in small plots. An experimental field planted with some crop 
is divided into a number of small plots, very much as a Petri-plate in 
Example 1 was divided into small squares. Then all the larvae found in 
each plot are counted. Naturally, the number of larvae varies considerably 
from one plot to another. The original mathematical model of the machin- 
ery behind this variability, the one strongly suggested by intuition, was the 
same as that used for the interpretation of the variability of the number 
of colonies from one square on the Petri-plate to another. Therefore, 
attempts were made to fit the observed distributions with a Poisson fre- 
quency law. Counts of larvae and attempts to understand the machinery 
of their distributions were made by many research workers. Table III, 


TaBLeE III 


Comparison of the distribution of beet web worms with the Poisson and Type A 
contagious distributions 


[G. Beall, Ecology, Vol. 21, 1940, p. 462] 


Treatment 1 (untreated) Treatment 2 Treatment 3 
Class 
Obs Poisson | Type A Ghai Poisson | Type A Che Poisson | Type A 
exp. exp. exp. exp. exp. exp. 


Sab an yg 80.1 116.7 205 | 196.2 203.8 162 | 138.6 157.6 
1 87 | 112.2 84.3 84 99.0 87.8 88 | 118.1 96.0 
2 50 78.5 58.3 30 25.0 25.9 45 50.3 45.4 
3 38 36.7 33.6 4 4.2 6.1 23 14.3. 17.6 
4 21 12.8 17.4 2 0.5 L.2 5 3.0 6.0 
5 7 3.6 8.3 2 0.5 

6 2 0.8 3.7 +0.1 +0.2 

7 2 0.2 1.6 +0;2 +2.4 
8 0 

9 ae etO FL +iel 

my 2.114 3.204 2.537 
me 0.662 : 0.157 0.336 
x? 46.8 = oa 4.0 11 20.2 aoe 
Py? 0.000 0.543 0.135 0.282 0.000 0.269 
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TaBiLe III—Continued 


Comparison of the distribution of diplopods with the Poisson and Type A contagious distributions 
[L. C. Cole, Ecological Monographs, Vol. 16, 1946, p. 71] 


yan: Obs. Poisson | Type A 
en exp. exp. 
0 128i 200.5 133.6 
1 rGl O52) 61.0 
2 34 45.4 30, 0 
3 11 14.4 17:2 
4 8 3.4 fins 
5 5 0.7 art 
Over 5 3 0.1 2.0 
my 1.307 
mo 0.712 
x? 20.5 4.1 
Py2 0.000 0.249 


taken from data in papers by Geoffrey Beall,? Lamont C. Cole * and S. B. 
Fracker and H. A. Brischle,* gives a few observed distributions and their 
comparison with theoretical distributions. 

In all cases, the first theoretical distribution tried was that of Poisson. 
It will be seen that the general character of the observed distribution is 
entirely different from that of Poisson. There seems to be no doubt but 
that a very serious divergence exists between the actual phenomenon of 
distribution of larvae and the machinery assumed in the mathematical 
model. When this circumstance was brought to my attention by Dr. Beall, 
we set out to discover the reasons for the divergence. 

From the discussion of Example 1 you will perceive that, if we attempt 
to treat the distribution of larvae from the point of view of Poisson, we 
would have to assume that each larva is placed on the field independently 
of the others. This basic assumption was flatly contradicted by the life 
of larvae as described by Dr. Beall. Larvae develop from eggs laid by 


2 Geoffrey Beall: “The fit and significance of contagious distributions when applied to 
observations on larval insects.” Ecology, Vol. 21 (1940), pp. 460-474. 

3 Lamont C. Cole: “A study of eryptozoa of an Illinois woodland.” Ecological Mono- 
graphs, Vol. 16 (1946), pp. 49-86. 

48. B. Fracker and H. A. Brischle: “Measuring the local distribution of Ribes.” 
Ecology, Vol. 25 (1944), pp. 283-303. 
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Tasie I[1]—Continued 


Comparison of distribution of ribes with the Poisson and Type A contagious distributions 
[S. B. Fracker and H. A. Brischle, Ecology, Vol. 25, 1944, p. 291] 





Number 
per 0.1 Che: Poisson | Type A * 
acre exp. exp. 
strip 
0 42 18.9 42.3 
1 11 23.0 15.6 
2 4 14.0 10.8 
3 1 5.7 5.9 
4 3 ey, 3.0 
5 1 0.4 1.4 
6 0 0.1 
Over 6 2/}l+0.2 +1.0 
m4 1.000 
m2 1.013 
x? 41.7 1.92 
Ee 0.000 0.392 


* In the original publication the fit given was worse, due to maladjustment of parameters 
my, and mg. 


moths. It is plausible to assume that, when a moth feels like laying eggs, 
it does not make any special choice between sections of a field planted with 
the same crop and reasonably uniform in other respects. Therefore, as far 
as the spots where a number of moths lay their eggs is concerned, it is 
plausible that the distribution of spots follows a Poisson Law of frequency, 
depending on just one parameter, say m, representing the average number 
of spots per unit area. 

However, it appears that the moths do not lay eggs one at a time. In 
fact, at each “sitting” a moth lays a whole batch of eggs and the number 
of eggs varies from one cluster to another. Moreover, by the time the counts 
are made the number of larvae is subject to another source of variation, 
due to mortality. 

After hatching in a particular spot, the larvae begin to look for food and 
crawl around. Since the speed of their movements is only moderate, it is 
obvious that for a larva to be found within a plot, the birthplace of this 
larva must be fairly close to this plot. If one larva is found, then it is 
likely that the plot will contain more than one from the same cluster. 
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Considerations of this kind were used to build up a mathematical model 
of the distribution of larvae which led to the following results. Let 
C(k) denote the probability that a plot will contain exactly k larvae, for 
-k =0, 1, 2, --+. The probability C(0) that there will be no larvae in 
the plot considered is computed from the formula 


CO (7) 


If C(O), C(1), ---, C(k) are computed, then Clk + 1) is given by the recur- 
rence formula 


mime" 2 ma! 
Cik+ 1) = —— 2, — Ck — 2). 8 
( ) 7a p>, ft (k 1). a (8) 
In particular, 
C 1) = —m,(1—e ™2) fe) —me2 9 
( ) é 1! myé ’ ( ) 
2 
Beal 
C(2)i enti ae a (m,2e7 7" + mye—™), (10) 


etc. 

It may be regretted that the formulae are somewhat complicated. How- 
ever, since the machinery behind the distribution of larvae is rather com- 
plex, one has to put up with the resulting inconvenience. 

Because, as we have observed, a plot that contains one larva frequently 
contains more than one, the distribution deduced was called “contagious.” 
Several distributions of a similar kind were deduced and, to make a dis- 
tinction, the one given by the above formulae was called contagious of 
type A with two parameters.® 

A distribution of type A depends on two parameters, m, and mz, which 
are connected with three quantities having a physical meaning as follows. 
Assume that the area of the plot on which the larvae are counted is equal 
to unity. Further, let m be the average number of batches of eggs per unit 
of area, and let » be the average number of survivors per batch of eggs at 
the time when the counts are made. Finally, let us introduce an area A 
which we shall call “area of accessibility.” Imagine a plot P of unit area 
on which counts of larvae are to be made and let S denote a spot on which 
a batch of eggs was laid. If S is far from P, then no larva hatched at S 
can be found in P. The area A, by definition, contains all points S such 
that larvae born at S can reach the plot P before the counts are made. 


5 The term “contagious distribution” was borrowed from G. Pélya, who was the first 
to consider this type of problem. See G. Polya: “Sur quelques points de la théorie des 
probabilités.” Annales de l'Institut Henri Poincaré, Vol. 1 (1931), pp. 117-162. 

See also W. Feller: “On a general class of ‘contagious’ distributions.” Annals of Math. 
Stat., Vol. 14 (1943), pp. 389-400. 
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Obviously, the more mobile the larvae are, the larger the area A and 
conversely. Consequently, if one counts very young larvae, then A is 
small, close to unity. For larger larvae, the area A is larger. It follows 
that a reasonable agreement between theory and observation may be 
expected only if counts include larvae of more or less the same age. 

The parameters m, and me are connected with m, A, and A by the fol- 
lowing formulae: 


nN 
mm = Am, Mz = ie (11) 
The mean number of larvae per plot is 
pa’ = AM = MM, (12) 
the variance is 
r 
po = Am (1 + ) = myMo(1 + mg). By 


It is seen that if the mean py’ is kept constant while the area of accessibility 
A is indefinitely increased, then the contagious distribution approaches the 
Poisson Law. Details concerning the distribution can be found in the 
original publication.® Table III gives the comparison between the observed 
distribution of larvae and the one expected on the basis of contagious dis- 
tribution of type A with two parameters. It is seen that in all cases the 
agreement is satisfactory. The data presented do not exhaust the instances 
where contagious distributions of type A fit actual counts of insects. In 
fact, it seems already safe to say that satisfactory agreement between this 
particular mathematical model and observation is a more or less general 
rule with the restriction that the life of the insects concerned does not 
depart too widely from the general scheme described above. On the other 
hand, there are organisms (e.g., scales) whose distribution on units of area 
of their habitat does not conform with type A. An investigation revealed 
that the processes governing the distribution of these organisms were much 
more complex than that described and therefore, if a statistical treatment 
is desired, a fresh effort to construct an appropriate mathematical model is 
necessary. 

In this example, in order to have agreement between the observed and 
predicted frequencies, it was imperative to adjust the mathematical model. 
This is generally the case when the phenomena studied develop by them- 
selves and do not admit of any sort of human control. In the next example 
we consider an instance of another kind where the experimental technique 
may be so changed as to fit a desirable mathematical model. 


6 J. Neyman: “On a new class of ‘contagious’ distributions, applicable in entomology 
and bacteriology.” Annals of Math. Stat., Vol. 10 (1939), pp. 35-57. 
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Example 4.—This example deals with a category of industrial problems. 
Problems of this kind are treated by Walter A. Shewhart* and the reader 
will find them of considerable interest. 

Many laboratories are engaged in what is called routine analysis. Small 
quantities of certain materials are sent to the laboratory for determining 
the content of a certain ingredient X. The sample is subdivided into a 
few portions, three, four or sometimes five, and these are analyzed sepa- 
rately. Denote the particular results by 21, 22, 73 and 24 respectively and 
by p the “true” content of the ingredient X so that the x; denote the meas- 
urements of p. 

Because of experimental errors the measurements 2; differ from p» and 
differ among themselves. Frequently there is evidence that the measure- 
ments could be regarded as random variables following a normal law of 
frequency, 





a2 i — (e—n)?/207 

p(x) von e ’ (14) 
so that this formula forms the mathematical model of the experiments of 
first order. The model may be used to estimate the value of » knowing 
only the values of four measurements 21, 2, X3 and x4. But we can proceed 
differently. Denote by f; and fz some two functions of the 2;. If the 2; 
are random variables, then f; and fz will also be random variables and we 
may consider probabilities of their satisfying any given inequalities. We 
may also look for some particular forms of the functions f; and fz such 
that the probability of their satisfying a given inequality shall be equal 
to any given number between zero and unity. Starting from this point of 
view it has been found that the functions ® 








ee taS 
1a aee, wee 
and (15) 
fr=84+— 
Sh te BRR 


have a remarkable property. Here Z is the arithmetic mean of the measure- 
ments z;, n their number, s their estimated standard deviation,® and t, the 


7 Walter A. Shewhart: The Economic Control of Quality of Manufactured Product. 
Van Nostrand, New York, 1931, 501 pp. 

8 J. Neyman, “Outline of a theory of statistical estimation based on the classical 
theory of probability.” Phil. Trans. Royal Soc., A236 (1937), pp. 333-380. See also the 
conferences on estimation and confidence intervals. 

Cte) 
(n= 1) 





9 That is, s is an estimate of o; s? = D 
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value of Fisher’s é corresponding to the number of degrees of freedom on 
which s is based, and to P=1—a=,eg., .01. If the measurements z; 
are independent random variables following the normal law (14), then what- 
ever be the values of u and o, the probability of f; falling short of u and of fo 
exceeding u is exactly equal to a = .99. 

This circumstance permits the estimation of » in the form of a random 
experiment. We perform the experimental analysis, obtaining the values of 
the z;, and then state that 


eh <sustt bass ; 
Nail Gain 1 
We may be wrong in this statement, but if the z; do follow law (14), the 
probability of our being correct is equal to a = .99. In other words, in 99 
percent of such experiments, our statement concerning yp will be correct. 

The arbitrarily chosen number a is called the confidence coefficient and the 
interval between f; and f the confidence interval. If the number of measure- 
ments is small, something like n = 4, then the value of tf, is considerable, and 
the accuracy of estimating » as measured by the length of the confidence 
interval 





(16) 


20,8 
Byles Flo (17) 
n 





is not satisfactory. 
In what preceded, the value of c in Equation (14) was considered unknown. 
If, however, o is known, then the confidence interval will be written as 
ieee Lac, ee ta Ta (18) 
HO oe =KH=w7 as 
Vn 
where 7’, is the value of t, corresponding to an infinite number of degrees of 
freedom in the estimate of o. What this means in practice may be judged 
from the following comparison. If a = .99, then T, = 2.576, no matter what 
nis. At the same time the values of ft, are, respectively, 





tor = 63.657 = ifn = 2, 
t.o1 ==" 9.925 foe 3, 
(19) 
toi =5.541 fn 4, 
ete. 


It follows that, whenever it is known not only that the analyses made 
in some particular laboratory provide numbers z that for practical pur- 
poses could be considered as particular values of a random variable fol- 
lowing the normal probability law (14), but also that the standard deviation 
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o has permanently some particular numerical value, then the same few 
parallel analyses could be used to provide an equally reliable but a much 
more accurate statement concerning the value of ». Therefore, if a labora- 
tory is permanently engaged in performing analyses of some particular 
kind, obviously it must be interested (i) in keeping the value of o constant 
over long periods of time; (ii) in estimating this value of o as accurately 
as possible; and (111) in keeping watch over possible changes in o. 

In order to keep o constant, say throughout a year, it is necessary fo 
eliminate all factors that may influence the accuracy of the analyses. This 
is frequently done; but before trying to estimate the value of o presumed 
to be constant, and before applying formula (18) instead of (16) we must 
see whether or not the measurements that are being obtained do agree 
with the mathematical model involving a constant o. Otherwise, repeated 
application of formula (18) may give a much greater percentage of errors 
than that expected. 

This circumstance was realized by J. Przyborowski, who published the 
following table illustrating his efforts to stabilize the accuracy of his 
analyses of oats. In Table IV, s;,7 is the estimated variance of four parallel 
analyses, and so? is the arithmetic mean of a number of such variances 
calculated for a long period of time, such as a year or more. If the value 
of o* were actually constant during such a period, then the value of so? 
would be a very accurate estimate and the mathematical model adopted 
would imply a known distribution of the ratio v = s;7/s9?. 

The comparison of the expected and observed frequencies of the values 
of v are given in the table for various periods. And here we see the curious 
results of efforts to stabilize the accuracy of analyses. Year 1925 is very 
bad; 1927 and 1928 show slight improvement, but are still bad. 1929 and 
1930 are excellent; but this probably caused a false sense of security of 
the personnel, and the next year 1931 is again bad. However, the three 
year period 1929-1931 seems to be satisfactory. We may reasonably hope 
that the experience of 1931 has stimulated the staff of Professor Przyborow- 
ski’s laboratory and that confidence intervals based on formula (18), where 
the value of o is estimated from a great number of previous experiments, 
do give correct statements concerning yp in nearly the expected percentage 
of cases, 100. 

4, Summary. Now let us sum up the main points that I have tried to 
emphasize. In speaking about probability, it is necessary to distinguish 1° 
three different but related aspects of the problem: 


(1) a mathematical theory, for example, the one described in my first 
lecture; 


10 Compare with H. Levy and L. Roth, Elements of Probability. Clarendon Press, 
Oxford, 1986, p. 15. 
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(2) the frequency of actual occurrences; 
(3) the psychological expectation of the participant. 


The mathematical theory need not be the one I described but, if it is 
mathematically accurate, it will have nothing to do with the outside world 
and, therefore, with either (2) or (3). This is for the good reason that 
an accurate mathematical theory implies accurate definitions and axioms 
and that in the outside world there are no objects that satisfy them except 
within limits “good enough for practical purposes.” 

The theory of probability may be constructed to provide models corre- 
sponding in some sense to certain phenomena of the outside world. And 
here we may distinguish a divergence: (i) Some authors try to provide 
mathematical models of what I called random experiments, the aspect 
falling under (2) above. The theory presented in my first lecture is one 
of the types which comes under this heading. The theory of Richard von 
Mises is another. (ii) In building a mathematical theory of probability 
we may aim at a model of the changes in the state of the human mind 
concerning certain statements that occur as a result of changing the amount 
of known facts. This view is exemplified by the theory built by Harold 
Jeffreys.11 It will be noticed that the theory of probability of my first 
lecture has nothing to do with a “state of mind,” although, if we find that 
the probability of a certain property is equal to 0.0001, for example, the 
state of our mind will undoubtedly be influenced by this finding. 

As I have mentioned, any theory may be correct if the authors are suffi- 
ciently accurate in their deductions. However, it is my strong opinion that 
no mathematical theory refers exactly to happenings in the outside world 
and that any application requires a solid bridge over an abyss. The con- 
struction of such a bridge consists first, in explaining in what sense the 
mathematical model provided by the theory is expected to “correspond” to 
certain actual happenings and second, in checking empirically whether or 
not the correspondence is satisfactory. 

The examples which I have given and many others which could easily 
be quoted indicate that, by taking care both in the constructing of a mathe- 
matical model and in the carrying out of the experiments, the bridge between 
the theory of probability sketched in this chapter and certain fields of 
application may be very solid. 


11See Jeffreys’ Screntific Inference, University Press, Cambridge (Eng.), 1931, 247 pp. 
Also numerous papers in the Proceedings of the Royal Society (Series A) and in the 
Proceedings of the Cambridge Philosophical Society. 
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Part 3. Tests of Statistical Hypotheses 


1. THE TRADITIONAL PROCEDURE IN TESTING STATISTICAL HYPOTHESES. The 
present lecture should not be considered as a direct continuation of the 
preceding ones which were systematically connected. However, the con- 
cepts discussed in my first two lectures will be used freely and combined 
with a few new ones. Since it would be impossible to give all the necessary 
definitions here, I must assume them to be known. 

The traditional procedure in testing statistical hypotheses is widely 
known but, as it is traditional, opinions concerning its exact nature vary. 
I shall describe here a version that seems to summarize the common phases 
in the history of several well known tests, such as the chi-test for goodness 
of fit, Student’s z test and others. 

If we had to test any specified (in the early stages, very vaguely speci- 
fied) statistical hypothesis H concerning the random variables, 


U1, T2, °°", Un, 


we used to choose some function T of the x’s which, for certain reasons, 
seemed to be suitable as a test criterion. Pearson’s chi-square and Stu- 
dent’s z are instances of such criteria. The next step, and sometimes a 
difficult one, consisted in deducing the exact probability law p(T'| H) or 
an approximate one, at least, which the chosen criterion 7’ would follow 
if the hypothesis H were true. The graphs of the probability laws con- 
sidered usually represented curves with a single maximum at a certain 
point of the range, decreasing towards the ends. This suggested a classi- 
fication of possible samples into two not very distinctly divided categories, 
“probable” and “improbable” samples. If a sample FE led to a value of 
the criterion T for which the value of p(T'| H) was small compared with 
its maximum, then the sample H would be called improbable, or the 
hypothesis H improbable, and conversely. You will certainly remember 
instances where both very small and very large values of chi-square are 
supposed to suggest that something is wrong. 

When an “improbable sample” was obtained, the usual way of reasoning 
was this: “Were the hypothesis H true, then the probability of getting a 
value of J as or more improbable than that actually observed would be 
(e.g.) P = 0.00001. It follows that if the hypothesis H be true, what we 
actually observed would be a miracle. We don’t believe in miracles nowa- 
days and therefore we do not believe in H being true.” 

The above procedure, or something like it, has been applied since the 
invention of the first systematically applied test, the Pearson chi-square of 
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1900, and has worked, on the whole, satisfactorily... However, now that 
we have become sophisticated we desire to have a theory of tests. Above 
all, we want to know why we should use this or that particular function T 
of the z’s as a criterion. Why should we test the goodness of fit by 
calculating 


m— m! 2 
xX =e ihe ie (1) 
m 
and not, say 
m — m’)? 
Pierce (2) 
m 
or 
m — m’ 
pile as | | (3) 
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or something else? What is the actual meaning of a statistical test? What 
is the principle of choosing between several tests suggested for the same 
hypothesis? It is the purpose of the present lecture to discuss some of 
these questions and to explain certain basic ideas underlying the contribu- 
tions to the theory of testing statistical hypotheses for which Professor E. 8. 
Pearson and myself are responsible. 

The first question I shall discuss is this: when selecting a criterion to 
test a particular hypothesis H, should we consider only the hypothesis H, 
or something more? It is known that some statisticians are of the opinion 
that good tests can be devised by taking into consideration only the 
hypothesis tested. But my opinion is that this is impossible and that, if 
satisfactory tests are actually devised without explicit consideration of any- 
thing beyond the hypothesis tested, it is because the respective authors sub- 
consciously take into consideration certain relevant circumstances, namely, 
the alternative hypotheses that may be true if the hypothesis tested is 
wrong. However, it is rather difficult to discuss what an author may have 
in his mind subconsciously, or even consciously. The easier thing is to 
consider the situations which may present themselves when we are forced 


1Since the publication of Lectures and Conferences in 1938, I have found that the 
first exact test of a statistical hypothesis was devised much earlier. In fact, this honor 
seems to belong to Laplace. In his paper, “Mémoire sur l|’inclinaison moyenne des 
orbites des cométes,’ Mémoires de l’Académie royale des Sciences de Paris, Vol. VII, 
1773 (see also Oevres complétes de Laplace, t. 8, Paris, 1891, pp. 279-321), Laplace 
deduced a test based on the exact distribution of the mean of a sample drawn from a 
“rectangular” distribution. Most readers of this book will be familiar with the fact that, 
when the sample size n is not too small, this distribution is very close to normal. 
Laplace gives the exact formula for the distribution and illustrates it on diagrams cor- 
responding to several values of n. Curiously, while his formulae are correct, the diagrams 
are wrong and bear no resemblance to the normal law! 
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to select a test for a particular hypothesis H with nothing to base our device 
on except the hypothesis itself. 

Suppose then that we have to test some hypothesis H, and that two dif- 
ferent criteria JT; and Ts, are suggested. Which of them should we use? 
What circumstances, reférring to H and to nothing else, should influence 
our choice? I cannot think of all the suggestions that have been made, 
but I do remember seeing opinions that the criterion with the smaller 
standard deviation would be preferable. 

Let us generalize this suggestion and consider more closely the tentative 
principle that the choice between possible criteria should be made on 
properties of their distributions as determined by H.- This principle, call it 
Principle I, would obviously cover the question of the relative size of the 
standard deviations. 

With regard to Principle I, I shall show that it is not sufficient for the 
choice. In fact, I shall prove that there may be two criteria having the 
following properties: 

(i) Both have identical frequency distributions; and therefore, on the 
basis of Principle I alone, it will be impossible to choose between them. 

(11) Whenever one of these criteria has the most “improbable” values, 
thus “disproving” the hypothesis tested, the values of the other are just 
the most “probable” ones. This last circumstance will make it necessary 
to choose one of the criteria. 

With the above situation in view, I shall mention another principle, to 
be called Principle II, which has been suggested by certain eminent workers 
in theoretical statistics: whenever you have two (or more) criteria, choose 
the one which, on the sample obtained, is less favorable to the hypothesis 
you test. 

This principle implies, of course, that criteria could, and should, be 
chosen after the sample is drawn and analyzed. 

I shall show that, if this principle is adopted, then it is useless to make 
any calculations with a view to testing hypotheses: given a certain amount 
of mathematical skill we shall be able to “disprove” any hypothesis on 
any sample. 

The above two principles do not exhaust all the possibilities. There may 
be other principles that do not go beyond consideration of the hypothesis 
tested. For example, we may require of the functions 7 used as criteria 
some particular properties, e.g., that they should be symmetrical with 
respect to the random variables, etc. However, I cannot think of any such 
limitation that would seem reasonable. Therefore, without claiming that 
the two propositions which I am going to prove provide decisive evidence 
that it is absolutely impossible to make a rational choice of criteria without 
explicitly or tacitly considering hypotheses alternative to the one that is 
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being tested, I am inclined to think that this conclusion is highly probable. 
Anyhow, the two propositions do cover a certain range of possibilities and 
clear away certain popular misconceptions. They show, for instance, that 
an argument like “use 7, rather than 7’. because its standard error is 
smaller” is not convincing. Let us now enter into details. 

2. INSUFFICIENCY OF PRrincIPLE I. Consider a system of n random vari- 
ables, 11, Y2, ***, Yn, known to be independent and following the normal law 


1 ig Sites és 
px (x4 Ae Ln) ~ ( ) e L(x; u)2/2 s) (4) 


ov Qe 





where o > 0 and yp are unknown constants. Suppose it is desired to test the 
hypothesis H that 1 = 0. This is known as Student’s hypothesis. The 
generally accepted criterion to test H is the one invented by Student, namely, 
to calculate 


Z2=->) (5) 
s 
where 
Z = — 27, ns? = D(x; — £). (6) 
n 
The probability law of z, if the hypothesis H be true, is given by 
pz(z) = C(l + 27)", (7) 
where 
Ca f (1 +2)" de = Bid(n — 1), 3). (8) 


The hypothesis H is to be rejected whenever the value | z’ | of | z| calculated 
for the sample is so large that 


Pilel2|2'[}=2f p@ a (9) 


is considered “‘small.”’ 

To prove the insufficiency of the Principle I as explained above I shall 
now define another criterion, depending on the quantity ¢, which will have 
the following properties: 

1. If H be true, then the probability law of ¢ is identical with that of z, so 


. pits CO). _ (20) 
2. The absolute value of the product | z¢| cannot exceed unity, ice., 


|e | <1. (11) 
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If the ¢ criterion were used to test H, then this hypothesis would be 
rejected whenever | ¢ | is large. In fact the large values of | ¢ | are “improb- 
able” whenever H is true. From (11) it follows that whenever | ¢| is large 
then | z | must be small and conversely. Thus, whenever one of the alterna- 
tive criteria z and ¢ indicates that the hypothesis H should be rejected, the 
other is bound to protest that there is no reason for such rejection. This 
means that whenever one of the criteria has a large absolute value, we 
are compelled to choose the one whose verdict we shall respect. Principle I 
will not help us in the choice, because the probability laws of z and ¢ are 
identical. This completes the proof of the insufficiency of Principle I. 

In order to define ¢ let us assume that the x; are numbered in the order 
in which they are given by observation. Let 





Sy cot a (12) 
V2n 
and 
n n 
ns’? = do a2 — n#? = 4(x, + 22)? + DO 27. (13) 
1 3 


The functions z#’ and s’ thus defined will be called the quasi mean and the 
quasi standard deviation of the x; Now I shall prove Proposition a, namely 
that the ratio © 


i¢ Salers (14) 


has the properties 1 and 2 described above. 

In order to prove 1, it is sufficient to show that the simultaneous probability 
law of Z’ and s’ is identical with that of the ordinary mean ¢ and standard 
deviation s. 

If the hypothesis H be true, then » = 0 and 


1 ih >) 2/2 2 
x one Ds Ln = a - e Ly o p 15 
Px(@i, ) Ge oD) 


Let us introduce a new system of random variables, y1, ye, «++, Yn, con- 
nected with the x; by the following formulas, 


n 1 
Y= 41 z + Y2 mn | 
ue i if (16) 
%2= _ —) 
2 Y1 9 Y2 9 


Uy = Yi for1 = 3, 4, --°, 7. 


48 MATHEMATICAL STATISTICS AND PROBABILITY 


It will be noticed that 
— pil ree —_ z! 
Y1 a/2n 


and is therefore identical with the quasi mean defined in equation (12). We 
shall return to this notation after a while. Furthermore, 


(17) 





21 + 29 
veeF /2 (18) 
and having regard to (13) we shall have 
1 
s'? = ss (yo? + ys? +---+ yn”). (19) 


The probability law of the y; will be deduced from equation (15) following 
the steps indicated in my first lecture, namely, 


Py (Y1, Y2) °°") Yn) = Px(1, 22, ***, n)| A| (Eq. 25, page 21) (20) 


where | A | is the Jacobian defined by equation (24) of page 21, and the z; 
on the right-hand side should be expressed in terms of the y;. Easy calcula- 
tions give 


= n ay z2 12 o 
PY (Y1, Yo, °°°, Yn) = P(E, Yo, °°, Yn) = Ree Oe 


where s” stands for the sum of squares (19). Our next step consists in intro- 
ducing still another system of variables, u 1, wo, +++, Un, one of which will be 
identical with #’ and another with s’. We put 


st a ae ok 
CP adi Dg Se 


Yy2 = V nus COS Un COS Un_1 *** COS U4 COS Us, 
¥3 = V nus COS Un COS Un_1 *** COS U4 SIN Uz, 
Y4 = V nus COS Un COS Un_1 °° SIN Ug, (22) 


Ya = V nus SIN Up. 


The range of variation of the new variables is determined by the following 
inequalities 
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0 < Us, 
0 S ug < 2r, 


—jr<uj< thn, 1=4,5,-°°,n, 


(23) 


wherefore outside these limits the probability law of the wu; is identically equal 
to zero. 
It will be easily seen that 


1 
Ug? = (yo? + ys? +++++ Yn”) (24) 
and later on we shall drop the notation u, and we, substituting for them Z’ 
and s’ respectively. Easy calculations give for the Jacobian 








pike Soy ers Yn) Ni m—1.n—2 2 3 n—3 
nad cae eV 72) la. COS Ug COS tz COS” Ue -*: COS" Uy, (25) 
d(uy, U2,"" "5 Un) 
and it follows that 
vn f —2,—n(ui1?+u2?)/20? 2 —3 
pu (ui, Ue ***, Un) = Use wise COS U4 COS” Ug * ++ COS” ? Un. 
ov 20 ; 
(26) 


In order to obtain the simultaneous probability law of wu; and we or, what 
comes to the same thing, of z’ and s’, we must integrate (26) for wg, w4, +++, Un 
from —«oto-+o. Since the integrand differs from zero only within the limits 
shown in (23), and since these limits for ug, u4, ---, U, do not depend on the 
values of uw; and ug, we have at once that 





n \” 
Mg dhs es Up" Ze Mur? u24)/20% (27) 


wherein 


C; -{- -f cs U4 COS” Us +++ COS”? Un dug dug dug +++ dun, (28) 


and the region of integration, w, is determined by 
0 as U3 < 2n, 


29 
—anr<uj<+3r fori = 4,5, -+:,n. (29) 


Remembering that wu; and ug are identical with z’ and s’ respectively, we 


have then he otnbey ss 
p(é’, s’) vat Cistntes*? +8'*)/20 : (30) 
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We see that the quasi mean and the quasi standard deviation as defined by 
(12) and (13) do follow a probability law identical with that of the ordinary 
mean # and standard deviation s of the z;._ In order to obtain the probability 
law of the ratio ¢ we must now perform on equation (30) exactly the same 
operations that lead to the probability law of Student’s z; and it is obvious 
that the probability law of ¢ will be found to be identical with that of z. This 
proves the first part of the proposition. 

Let us now prove part 2, namely, that | A <1. For this purpose notice 
that, whatever be the real numbers a and 6, we shall have 


(a+b) =a+ 2ab+07? 20 (31) 
and therefore 
2| ab| < a? + b?. (32) 
It follows that for any real numbers a and B, 
(a + bd)? S 2(a? + Bb). (33) 
If s is the ordinary standard deviation of the x; and ¢ their mean, then 
ns” = Z(x; — £)? = (4, — £)? + (a2 — 2). (34) 
On the other hand the definition of the quasi mean gives us 
2nz"? = (x, — te)” = [(t1 — %) — (ze — 2)? (35) 
and, from (33), we see that 
2nz" S 2[(a. — £)? + (a2 — £)? I. (36) 
Comparing (34) and (36) we find that 
Ben Sey (37) 


an inequality between the squares of the quasi mean and of the ordinary stand- 
ard deviation. From the definition of the quasi standard deviation (13) it 
follows that 


D2,? = n(s + £7) = n(s? + €*). (38) 
Therefore 
3’ ee Zl = s? a8 x2 (39) 
and, owing to (37), 
Heat Tes (40) 
Multiplying (37) and (40) and dividing the resulting inequality by the 
product s?s’”, we get 


EC) © 


which is equivalent to | zt | < 1, or equation (11) of page 46. This fulfills 
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the proof of part 2 of Proposition a. Thus we have shown that Principle I 
by itself is not sufficient for a choice between alternative criteria that may be 
suggested for testing a given hypothesis. 

3. CONSEQUENCES OF SUPPLEMENTING PRINCIPLE I BY PrincipLe IT. We 
shall now show that Principle I could not usefully be supplemented by 
Principle II. The combination of the two principles would read as follows: 
if there are several criteria for testing a given hypothesis H, all following 
the same probability law as determined by H, then the choice among them 
should be made after the sample is drawn and examined, and we should 
choose the test that appears to be the least favorable to H. We have already 
seen that if Student’s hypothesis (page 46) be true, then Student’s z is not 
the only function of the x; following the familiar probability law (7). We 
shall now show that, whatever be the sample EH’ observed in a particular 
case, not all the x; being equal to zero, it is possible to find a criterion, say 
£°, which for this particular sample possesses the value -+-co and which, 
in repeated sampling, follows exactly the same law as z and ¢ discussed 
above. If we adopt both Principle I and Principle I, then we shall have 
to test Student’s hypothesis using £°; and this will lead to the rejection of 
the hypothesis. Thus in all cases, with the sole exception that all observed 
x; are equal to zero, Student’s hypothesis will have to be rejected, which 
shows that the combination of the two principles I and II is not a reason- 
able solution of the difficulty. 

I shall now call the attention of the reader to the distinction between 
x/ and x; used below. The symbol 2; will mean, as before, the random 
variable following the law (15). On the other hand 2,’ will denote a value 
of x; observed in some particular case. 

Proposition b.—Whatever be the sample 


Los ty’, 2! ae, Ln! (42) 


observed in a particular case, one at least of the x,’ being different from zero, 
it is possible to define a criterion ¢° which is represented by a function of the 
x; and which has the following properties: 

(i) The probability law of ¢°, as determined by H, is the same as that of 
Student’s z and that of ¢, equation (7), page 46. 

(ii) The value ¢°(E’) of ¢°, calculated for the sample E’, is infinite. 

It will be noticed that ¢° will have to be adjusted to the sample E’ already 
observed. ‘Therefore the values (42) will have to enter into the expression 
of ¢°. They are constant numbers and will play the role of coefficients. On 
the other hand, ¢° will depend also on the random variables 2;. 

Proof of part (i) of Proposition b.—Since the order in which the x; are num- 
bered is of no consequence, we may assume that 2’, vo’, ---, %m’ are different 
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from zero, m <n. Before defining ¢° we shall need the numbers a}, ag, -*:, 
Qn, Which are connected with the x,’ by the n equations 


2," 


ree 
Vay? + x9? +++ an? 


Obviously a; ~ 0 for 7 = 1, 2, ---, m, but a; = Ofort = m+ 1, ---, n; also 


Cine foe ee oes (43) 


>> a? = 1. | (44) 


Further steps consist first in defining a “pseudo mean” <”’ and a “pseudo 
S.D.” s” and then in making the identification 


ors (45) 


Here the pseudo mean and pseudo §8.D. are defined by 


Ayr ah eke Antn 
ret es oe 46 
> Vn ( ) 


and 
1 
gi? >> r,2 sa lle (47) 
n 
It will be noticed that if a; = 1/ Vn for i = 1, 2, ---, n, then the pseudo 


mean and pseudo S.D. become identical with the ordinary ones, € and s. 
It will be sufficient to show the existence of a system of variables 


V1, V2, °°", Un, 


whose elementary probability law as determined by H is 


n 
—n(v}2+8"2)/2¢? 
Up aGasindity 4 Scie ee ; 48 
pv ( 1) 42 ? ) (oV/ 20)” ( ) 
wherein = 
vo, = #” and ns’? = (vo? +---+ »,?). (49) 


To show that v;, ---, v, exist and that they possess the probability law (48) 
we introduce 


By = al (ay? +++ ++ 07,1) (ay? +++ ++ a%?)]~” 
for k = 2,3, ---,m and 6, =0 fork >m. (50) 
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Now, we relate 21, v2, +++, Un tO 1, Tz --+, Xn by the following system of 
equations: 


41 = V nov + ayBove + a1(83v3 + Bats +--+ + Bint), 


=f ay” 
tg = V na — —— Bovq + a28303-+ a2(Bav4 + 8505 +--+ Bm); 


a2 


a3 


vie ay” 5 ay” 
3 = Naz, — ————— B33 + 38404 + a3(B5V5 a sin a Bide); (51) 
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for k = 2, 3, ---, m. In interpreting these equations, it is important to 
remark that, owing to the definition of m and §;, if m < n, then 
B41 =e Bmn+2 =::- = 6, =0. (52) 


If m = n, then equations (51) define the transformation completely. Other- 
wise, if m <n, we put 


Li 0; fort = m+1,---,n. (53) 


With some algebraic reduction and the fact that a,” +---+ a,” = 1 (equation 
44), it will be found that 


1 
Up Tae ae (ayXy ar ttt iGytn) nk” (54) 
and that 
(@,? ee ose on) = nv,” ai (v9? a Eis --+- Un?) 
= nv? + ns’. (55) 
‘ 0(x1, yer ay on) : ‘ 
The Jacobian | 4 | = | ———~—__| = Vn, as is not difficult to work 
0(01, a aie ar Un) 


out from equations (51), (52), (53). From equation (55), and the value of 
the Jacobian, it follows by applying equation (25) of page 21 that if equation 
(15) is the simultaneous elementary probability law of 21, re, +--+, 2n, then 
that of 01, v2, --+, Yn must be as written in equation (48). 

Since equation (48) is of the same form as equation (21), and since formula 
(45) is similar to (14), it is clear that the steps required to deduce p(¢°) from 
(48) would be identical with those already shown in the deduction of p(¢) 
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from (21). This completes the proof that the criterion of ¢° has the property 
(i). 

Proof of part (ii) of Proposition b—We must now prove the other statement 
(ii) on page 51 concerning ¢°; namely, we must prove that if in the expression 
for ¢° we substitute, instead of the random variables z;, the particular observed 
values x,’ of (42) in terms of which the function ¢° has been defined, then the 
value ¢°(Z’) of ¢° will be found infinite. Replacing x; by z,’ in equation (46), 
and remembering that the coefficients a; therein have already been defined by 
equation (43) in terms of the z,’, we easily find that the value of the pseudo 
mean calculated for the sample £’ is 

12 12 Sete 12 
2'(E’) = Ae wit nee a > 0 (56) 
n 
because at least one of the numbers 2,’ is different from zero. Further, sub- 
stituting 2,’ for x; in equation (47) to calculate the pseudo S.D. s’’(H’), we 
find it to be zero. It follows from equation (45) that 


a!" (B’) ¥ 
s'(B’) - 





P(E’) = co (57) 
and this completes the proof of part (ii) of Proposition b. 

For the one particular sample E’ already drawn, ¢° has the value oo, but 
in repeated sampling it follows the same law as z and £. 

It may be useful here to make the following remark. No number of 
examples is able to provide a proof of a general statement. On the other 
hand, the failure of a single example is sufficient to disprove any general 
statement. Our purpose here was to show that the principles I and II 
could not generally be applied for making a choice among criteria for 
testing hypotheses, and the validity of the proof does not suffer from the 
fact that we have limited ourselves to the consideration of one particular 
example. 

As a matter of fact, it is easily seen how the above reasoning could be 
generalized, but such generalization would not produce any new relevant 
result. 

4, GENERAL BASIS OF THE THEORY OF TESTING STATISTICAL HYPOTHESES. I 
shall finish this lecture by indicating what appears to be the general basis 
of the theory of testing statistical hypotheses. We must start by consider- 
ing the situation in its most general form. 

(1) When we desire to test a particular statistical hypothesis. Ho, we 
imply that it may be wrong. E.g., if we try to test Student’s hypothesis 
that » = 0, we admit the possibility that it may be wrong and that, there- 
fore, » may have some value other than zero. It will be seen that when- 
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ever we attempt to test a hypothesis we do admit, although perhaps sub- 
consciously, that there are hypotheses that are contradictory or, in our 
terminology, alternative, to the one tested. There is no reason why these 
alternative hypotheses should not be considered explicitly when choosing 
an appropriate test. 

(11) Whenever we attempt to test a hypothesis we naturally try to avoid 
errors in judging it. This seems to indicate the right way of proceeding: 
when choosing a test we should try to minimize the frequency of errors that 
may be committed in applying this test. 

Having in mind the above two points (i) and (ii) we may proceed further 
and discuss the kinds of errors we may commit in testing any given 
hypothesis Ho. It is easy to see that there are two kinds: 


After having applied a test we may decide to reject the hypothesis Ho, 
when in fact, though we do not know it, it is actually true. This is 
called an error of the first kind. 

After having applied a test we may decide not to reject the hypothesis 
Ho (this may be described in short by saying that we “accept Ho”) 
when in fact Ho is wrong, and therefore some alternative hypothesis 
H’ is true. This is called an error of the second kind. 


The test adopted should control both kinds of errors. Now let us see 
what essentially is the machinery of any test, whatever be the principle on 
which it was chosen. 

A test is nothing but a rule by which we sometimes reject the hypothesis 
tested and sometimes accept it (in the sense explained above), according 
to whether or not the observations available possess some properties speci- 
fied by the rule. The observations are some n numbers, 7, %2, **+, Lp» the 
system of which could be represented by a point H in the n-dimensioned 
space W, having the 2; for the n coordinates. The point # and the space 
W are called the sample point and the sample space. Any rule specifying 
cases where we should reject the hypothesis tested is equivalent to a speci- 
fication of the positions of H within W which, if arrived at by observation, 
lead to a rejection of H. These positions usually fill up a certain region, 
w, which is called the critical region or the region of rejection. 

In conclusion we see that to choose a test for a statistical hypothesis 
Hp» we roust choose a critical region w in the sample space W and make 
a rule of rejecting Ho whenever EH, as determined by observation, falls 
within w. 

Let us illustrate this by an example. Consider the case where a sampled 
population is divided into n categories and we test the hypothesis that the 
probability of an individual falling within the ith category has some 
specified value p, for 1 = 1, 2, ---, n. Denote by M the total number of 
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observations and by m, the number of observations belonging to the ith 
category. 
The generally accepted test of this hypothesis consists in rejecting it 
whenever 
2 
x? = > (me — Moi)" (58) 
Mp; 
is “too large.’”’? What ‘‘too large”? means is a subjective question, but there 
must be a more or less definite limit between values of chi-square that are 
“too large” and others that are not. Let x,” denote this limit; and consider 
a space of n — 1 dimensions, the coordinates of any point being m, mz, -::, 
Mn—. As none of the m; can be negative and their sum cannot exceed M, 
the sample space W will be composed of points E with all coordinates my, 
Mg, ***, Mn—1 being non-negative integers and satisfying the inequality 


m + me+::-+ m1 SM. (59) 


It is easily seen that the rule of rejecting Hy whenever x” > x,” is equivalent 

to considering the region w lying within W and outside the ellipsoid 
UW. ae M i 4 
eo ay (60) 
Mp; 

as the critical region. 

It is equally easy to see that any other test has a similar feature. For 
example, Student’s test is equivalent to a rule of rejecting Student’s hypothesis 
whenever the sample point falls within a circular hypercone with the axis 


ty Sg ae ew ee (61) 


Having disposed of this we may go on to discuss the probabilities of errors. 
First of all: is it legitimate to discuss the probabilities of errors in testing 
statistical hypotheses? Isn’t this equivalent to discussing the probabilities 
of hypotheses themselves, which would be useless? E.g., it would be useless 
to discuss the probability of Student’s hypothesis because this would be the 
same as the probability of » = 0. As » is an unknown constant, the proba- 
bility of » being equal to zero must be either P{y = 0} = 0 or P{y = 0} = 1 
and, without obtaining precise information as to whether p is equal to zero 
or not, it would be impossible to decide what is the value of P{y = O}. 

To this criticism the answer is the following. Undoubtedly, » 2s an 
unknown constant and, as far as we deal with the theory of probability as 
described in my first two lectures, it is useless to consider P{u = 0}. On 
the other hand our verdict concerning the hypothesis tested, Ho, depends 
on the position of the sample point EZ, that is to say, on its coordinates, and 
these, according to our assumptions, are random variables. It follows that 
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our verdict 1s random and that there is no inconsistency in considering the 
probability of the verdict having this or that property, for example, of its 
being erroneous. 

Consider the sample point H and any region w in the sample space. 
The probability of # falling within w may depend on the hypothesis that 
happens to be true. For example, if formula (4) represents the probability 
law of the z;, and » = 0, then the probability of E falling within some 
particular region w may be 1/2. On the other hand if » = 10, say, the 
same probability may be equal to 0.0001. Therefore we shall agree to 
denote by P{E «w | H} the probability of E falling within w calculated on 
the assumption that the hypothesis H is true. 

Now consider a hypothesis Ho which we desire to test, and any region w 
which we have chosen to serve as critical region. What are the circum- 
stances in which we commit an error of the first kind? They are: (i) the 
hypothesis tested is true; and (ii) the sample point E falls within the 
critical region w, whereupon Ho is unjustly rejected. It follows that the 
probability of an error of the first kind must be calculated on the assump- 
tion that Ho is true and, in fact, it is the probability 


P{Eew| Ho} (62) 


of E falling within w. 

Now let us turn to errors of the second kind. For an error of the second 
kind to be committed it is necessary (and sufficient) that the hypothesis 
tested Ho be wrong and that the sample point fail to fall within the critical 
region selected. But if Ho is wrong, then some other admissible hypothesis 
H’ must be true. Therefore, the probability of an error of the second kind is 


1— P{Eew| H’}. (63) 


Obviously, instead of considering the probability of committing an error 
of the second kind, we may consider the probability of avoiding it, which 
is denoted by B(w | H’), so that 


B(w| H’) = P{Eew| H’}. (64) 


B(w|H’) considered as a function of H’ is described as the power (the 
power of detecting the falsehood of the hypothesis tested) of the region w 
with respect to the alternative hypothesis H’. 

Any rational choice of a test must be made with regard to the properties 
of the power (64). Indeed, the values of the power 8(w|H) for a fixed 
region w and for a changing hypothesis H (which in particular may be 
Ho, the one we desire to test) give no more and no less than a complete 
description of the properties of the test based on the critical region w. In 
fact, what could be called “the properties of a test?” ‘To know the proper- 
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ties of a test can mean nothing but to know (i) how frequently this test 
will reject the hypothesis Ho tested, when it is true; and (ii) how frequently 
it will disprove Hy when Ho is wrong. That is exactly what the values 
of the function B(w|H) tell us. Without knowing the properties of 
B(w |H), we cannot very well say that we know the properties of a test 
based on w. And just these properties of the power seem to be the proper 
rational basis for choosing a test. 

For example, by considering the power of Student’s test, it 1s possible to 
show that this test has the following properties, which put it above any 
other test that may be suggested. 

1. The probability of rejecting the hypothesis Hp that » = 0 is always 
greater when the hypothesis Ho is wrong than in cases when Ho is true. 
This property is described by the adjective “unbiased” attached to the 
test possessing the property. 

2. Any other unbiased test, if it leads to the same frequency of errors of 
the first kind, will less frequently detect the falsehood of the hypothesis 
Ho when Ho is in fact wrong. 

The responsibility for the above concepts and for the resulting theory 
of testing statistical hypotheses is borne jointly by Egon S. Pearson and 
the present writer. Our first paper? on the subject was published in 1928, 
over twenty years ago. However, it took another five years for the basic 
idea of a rational theory to become clear in our minds.* Thereafter, the 
work became easier and within a short time we were joined by a number 
of colleagues.* 

Bare statements of principles are never clear unless the principles are 
illustrated in full detail with examples. It would be most satisfactory if 
the use of the concepts described above could be illustrated with examples 
which are both easy and of practical importance. Unfortunately, it is very 
difficult to satisfy both conditions at the same time. One must choose 
between the illustrativeness of an example which involves a certain arti- 
ficiality and the practical importance of a test which involves technical 
difficulties in dealing with the problem. Faced with the necessity of choosing 
between the two alternatives, the writer felt that the readers of this book 
would be best served by a simple illustrative example, even though it is 
somewhat artificial. 

We will imagine an early stage in the study of a pair of genes, the domi- 


2 J. Neyman and E. S. Pearson: “On the use and interpretation of certain test criteria 
for purposes of statistical inference.” Biometrika, Vol. 20-A (1928), pp. 175-240 and 
264-299. 

3 J. Neyman and E. 8. Pearson: “On the problem of the most efficient tests of statis- 
tical hypotheses.” Phil. Trans. Roy. Soc., London, Vol. 231A (1933), pp. 289-337. Re- 
cently, a systematic elementary presentation of the theory was given in the author’s 
First Course on Probability and Statistics already quoted. 

4See: Statistical Research Memoirs, Vol. I (1936), Vol. II (1938). 
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nant gene to be called G, the recessive g. We imagine that it is more or 
less taken for granted that the mating of the organisms carrying these 
genes is non-assortative (i.e., that the genetical composition of one mate is 
independent of that of the other mate) and is of uniform fertility. Con- 
trary to this general belief, a geneticist suspects that the recessive types 
gg do not participate in the reproduction. This suspicion is not based on 
any trials but on some analogies, and, in preparing for a meeting at which 
the genes G and g are to be discussed, the geneticist is somewhat hesitant 
whether or not to come out with his doubts. Before deciding, he wishes to 
take into account the results of two independent experiments performed for 
other purposes, but involving genes G and g. Both experiments had the 
same pattern. In each case two hybrids Gg & Gg were crossed, giving a 
generation of progeny which we shall denote by F,. Next the F, indi- 
viduals were allowed to mate without interference, producing the second 
generation Fs. Finally the F, individuals were allowed to mate without 
interference and they produced the third generation F’3. Since the two 
experiments were carried out for purposes not connected with genes G and 
g, the records of the experiments appear to be fragmentary as far as the 
genes G and g are concerned. In fact, the only information concerning 
these genes in the first experiment is that the F, generation was composed 
of n; = 8 individuals and that among them there were exactly x, recessives 
gg. Further, the records of the second experiment show only that the Fs 
generation was composed of nz = 10 individuals and that among them 
there were exactly ze recessives gg. The values of the four numbers ny, 2 
and 2, X2 must now be used by the geneticist to make up his mind whether 
or not to voice doubts about the non-assortative character of mating. Every 
human action is subject to error, and therefore the geneticist would not 
mind being in error from time to time. However, he is inclined to lay down 
rules for his behavior so as to control the frequency of errors. First, in 
cases where some established hypotheses are true, he would like to voice 
doubts of these hypotheses only rarely, say with a frequency not exceeding 
a selected number «, perhaps a = .1 or « = .05 or the like. Another require- 
ment which the geneticist lays down for his behavior is that, in cases where 
some hypothesis Hz, alternative to the established hypothesis Hj, is true, 
then he wants his rule to lead him to protest as frequently as is humanly 
possible. 

Applying these two principles to the case of the genes G and g, the 
geneticist notices that n; and nz are sure numbers while 2, and 22 are ran- 
dom variables whose particular values are determined by the two experi- 
ments. Let H, and Hz» denote the two hypotheses under consideration. 
Namely, H; asserts that, with respect to genes G and g, the mating is non- 
assortative with uniform fertility and H»2 asserts that the non-assortative- 
ness and uniform fertility apply only to dominant and hybrid types GG 
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and Gg, but that the recessives gg do not participate in the reproduction. 
We will assume for simplicity that the geneticist admits the possibility of 
only these two hypotheses H, and Ho. 

On either hypothesis, the random variables x, and 2, are capable of 
assuming all the 99 different combinations of integer values x7, = k; and 
x5 = ke, with ky = 0, 1, 2;°->-, 8 and’ ks = 0) 1/2; >) 10) “Piuetine 
sample space W is composed of 99 points with coordinates (k;, ka). Easy 
calculations give the probability that the sample point H = (a, 22) will 
assume the position (k;, kz). Namely, on the hypothesis H; we have, say, 


pki, ke | Ay) = P{(a = ky) (a2 = k2)} 
~~ ORL Ue ae a Yims: (65) 
On the hypothesis Hz we have 
p(ki, ke | He) = P{(x1 = ky) (2 = ke) | H2} 


SC Cah ta Chan) me (66) 
Tables I and II give the numerical values of these probabilities for all 
combinations of k; = 0, 1, 2, ---, 8 and kz = 0, 1, 2, ---, 10 in so far as 


these probabilities are not too small. Upon adding all the entries in 
Table I the reader will obtain the total .998. Thus the probability is 
approximately .002 that the sample point E will occupy any position in 


TABLE I 


Joint probability distribution of x; and x2, P{(x1 = ki)(xe = ke) | Hy}, as determined by the 
hypothesis Hy 


ky 
ke 
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TaBLe II 
Joint probability distribution of x1 and x, P{(x1 = k1)(xe = ke) | H2}, as determined by He 


ky 
ko 
0 1 2 3 4 5 

0 204. 204. 089 .022 003 000 
1 136 136 060 .015 002 000 
2 041 041 018 004 001 000 
3 007 007 003 001 000 000 
4 001 001 000 .000 000 000 
5 000 000 000 .000 000 000 


the sample space for which the entry in Table I is zero or is not listed at 
all. The same probability for Table II is equal to .004. 

Consider now the problem of selecting the combinations of values of 
X; and 2, such that, if any one of these combinations is determined by the 
two experiments, then the geneticist would consider it advisable to reject 
the hypothesis H;. In the terminology of this lecture, the problem is 
that of selecting the critical region wo for testing the hypothesis H; against 
the set 2 of admissible hypotheses which, in this case, includes H; and Hz 
only. The principles which the geneticist laid down for his choice are 
exactly those determining the best critical region for testing H, against Q. 
The first of these principles is that the region wo be one of those regions w 
for which 


P{Eew|H,} Sa. (67) 


The second principle is that, if wo is the selected region and w any other 
region such that 
P{Eew|H,} S P{Eewo| Hy} 
then 
P{E ew|H2} S$ P{Eewo| He}. 


The construction of the critical region wo having this property is easily 
accomplished by the following simple rule, the validity of which will be 
proved in general, for any number of discrete observable random variables 
J I. on 

Denote generally by ¢@1, €2, °**, @n, °** all possible positions of the 
sample point HZ as may be determined by some observations. Let further 
p(e,| Hi) and p(e,| Hz) denote the probabilities determined by the 
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hypotheses H, and He, respectively, that E will coincide with e,. Here 
some of the probabilities p(e; | Hi), 1 = 1, 2, may be zero while others are 
positive. For each point e; for which p(e; | H2) > 0 define the ratio 


Lemma. IF a Is A POSITIVE NUMBER AND Wo A REGION IN THE SAMPLE SPACE 
SUCH THAT IT INCLUDES ALL POINTS €,; FOR WHICH R (€;) <.@ AND NONE OF 
THOSE POINTS €m FOR WHICH R(é€m) > a, THEN, WHATEVER BE ANY OTHER 
REGION W SUCH THAT 

P{Eew|H,} S P{Eew| Mi}, (68) 
NECESSARILY 
P{E ew| Ho} S P{Eewo| Ho}. 

If the regions wo and w are contemplated as critical regions for testing 
Hy, then P{H ew | H;} is the probability that H, will be rejected using w 
in those cases when the true hypothesis is H;. Thus P{H ew | H;} is the 
probability of an erroneous rejection of H; (that is, rejection when H, is 
true, or the probability of an error of the first kind). On the other hand, 
P{E ew | H2} is the probability of rejecting H, when the true hypothesis 
is Ho, 1.e., it is the power of the test based on w. This property of wo may 
be described verbally by stating that out of all critical regions w which 
control the errors of the first kind as well as wo or better, the critical region 
Wo has the greatest power. 

In proving the Lemma it will be convenient to use the following notation. 
Let u be some region in the sample space and let 


Chir key °°") Ckim 
be all the possible positions of the sample point H# which fall within the 
region u. Then the probability P{H «u| H;} that the sample point will 
fall within u is given by the sum 


P{E €U | H;} — > p(ex,; | H,;). 
vl 


It will be convenient to denote this last sum simply by a p(e | H,;). 
With this notation, the inequality (68) can be en as 
2 P(e| Hi) 2 Xs Peel My) 
and it follows that, say, 
AGH) = 2) n(e| Fh) > 2a ple| Ha) & 0. (69) 
wo 
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The two regions wp and w may have a common part which we will denote 
by v. Should there be no common part of wy and w, then v will stand for 
the “empty” set of points. In any case we may write that 


Wo = (Wo — ¥) + | (70) 
w= (w—v) +2, 


and it is clear that every point in w — v lies outside of wo. 
Obviously A(H,) can now be rewritten using the summation over the re- 
gions Wo — v and w — 2, 
‘A(Hi) = 2) ple| Hi) — XS vle| Hi) 20. (71) 
Wo—-v W—v 
The region Wp — v contains only points e, which are interior to Wo. Because 
of the definition of wo, for each of these points 


p(e| Hi) = R(e)p(e| He) < ap(e| Hz). (72) 
Therefore, say 
A’ =a ) p(e| He) — 2) v(e| Wi) = A) 2 0. (73) 


Since each point e belonging to w — v lies outside of wo, the definition of wo 
implies that for each such point 


ple | H,) = R(e)pke | Hy) = aple | 2). (74) 
Therefore 
A(H2) = a >) p(e| He) — a D> p(e| He) 2 A’ = ACM) 20. (75) 


Since a is a positive number, it follows that 
dX. p(e| Hz) = LX ple| He). (76) 

wo-v w—v 
Adding to both sides of this inequality the same sum >, p(e | Hz), we obtain 


the desired result, namely, 


P{E ewo| Ho} = > p(e| Ha) = do v(e| He) = P{Hew| H2}. (77) 


This completes the proof of the Lemma. 

It follows from the Lemma that the operations necessary for determining a 
best critical region for testing H, with respect to a single alternative hypoth- 
esis H» are the following. 

(i) Compute the ratio R(e) for all possible sample points. 

(ii) Renumber the possible sample points in order of the magnitude of 
the corresponding ratios R(e), beginning with the smallest R(e,), so that 
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Rei) S R(ee) S:++S Rex) S R(x) S-°-. (78) 


(iii) Include e, in the critical region wo and also as many of the following 
points, @2, €3, °**, €m as possible without impinging upon the condition that 
the probability determined by H; of the sample point ZH falling within wo 
does not exceed «, 


P{E ewo| Hi} = Dy p(e: | H1) ae (79) 


Returning to the problem of testing the hypothesis H; concerned with 
non-assortative mating and uniform fertility, we could proceed in two 
slightly different ways. One of these consists in computing the ratios 
R(e) numerically as indicated in step (i). The disadvantage of this method 
is that it is somewhat cumbersome and involves ratios of numbers which 
are so small that they are not recorded in Tables I and II. 

The other method is to compute the formula for R(e). We have, say 


D(k1,ke | 1) 
p(k1,ke | H2) 


Oni +n2—ki—kegniy ene (80) 
ma Amitnegni~kiy pna— ke 


R(e) = R(ky, ke) = 


= 0(3)"5" = CR'(ky,ka). 
3 


where, for the sake of brevity, the letter C is used to denote the numerical 
factor 


gritnegniy G72 


Oe Aritnegniy 5n2 


(81) 
which is independent of k; and kg. It is obvious that instead of ordering 
the points e in the order of magnitude of R(e), we may order them in the 
order of magnitude of R’(k,k2) or, since this is even more convenient, in 
the order of magnitude of, say 


r(kyke) = logio R’(k1,k2) = ki logio (3) + ke logio 5 
= ky(.42597) + ko(.69897). 


Now it is obvious that the first point to be included in wp is the one cor- 
responding to ky = ky = 0. The next most desirable point is ky = 1, ke = 0, 
etc. Table III gives the ordering of points (k1,k2) as indicated in step 
(ii), the corresponding values of r(ki, ke), the corresponding probability 
determined by H,; and Hy» and the cumulative sums of these probabilities. 
The most interesting columns in Table III are columns (5) and Cay 
Column (5) gives the probabilities determined by H, that the point # to 


(82) 
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TABLE III 


Steps (iz) and (iti) in determining wo 





(1) (2) (3) (4) (5) (6) (7) 
a te p(ex | Hy) p(ex | Hs) ; 
7 i) = a By) 
point ares r(k1, ke) (ka, ke Hy) = pe; | Fy) p(k1, ke | Hp) 2s ple; | Hp) 
5] = 17= 


Sane (seems (i ia 


e1 0, 0 00000 006 .006 204 204 
eg 10 42597 015 .021 204 408 
€3 OF ft 69897 019 .040 136 544 
e4 2,0 .85194 .018 .058 .089 633 
e5 il 1.12494 .050 . 108 . 136 . 769 


be determined by observing 2; and 22 will fall within the critical region 
Wo including only the point e,, or the two points e; and ég, or three points 
€1, €2, €3, etc. These probabilities, then, are the probabilities of wrongly 
rejecting the hypothesis H; when it is in fact true, corresponding to an 
increasing critical region wo. For example, if the geneticist decides that 
he should not raise false doubts concerning hypotheses more often than 
five times in a hundred when such hypotheses are true, then his critical 
region should include only three points (0, 0), (1, 0) and (0, 1) with the 
resulting probability of an error of the first kind equal to .040. Should 
this be his decision, then the probability of detecting that H, is false when 
the true hypothesis is Hz (or the power of the test), is .544. It is found in 
column (7) of Table III. 

However, the geneticist may compromise on the probability of the error 
of the first kind equal to .058, or even .108. Then his chances of detecting 
the falsehood of H; when the true hypothesis 1s Hz will be .633 or .769, 
respectively. 

Whichever critical region is finally adopted, including any‘number of the 
first points e; ordered according to the value of r(k1, kz), the Lemma guar- 
antees that the power of the resulting test cannot be improved by using any 
other critical region which controls the errors of the first kind to the same 
(or better) level as the region chosen. 

Suppose now, that the values of 2, and 22 that were actually observed 
are k; = 2, ko = 0. It follows from the foregoing that, if the geneticist 
does not insist on the probability of an error of the first kind being less 
than .058, he should go ahead and voice his doubts of the hypothesis Hy; 
of non-assortativeness of mating and of uniform fertility. In taking this 
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step he should be aware that the above analysis: does not contribute any- 
thing about the falsehood or correctness of the particular genetical hypoth- 
esis H,. In fact, no test can reveal any definite information about any 
statistical hypothesis if the values of the observable random variables 
which are possible under this hypothesis are also possible under some 
alternative one. All the geneticist can be certain about is that, if his 
attitudes towards statistical hypotheses are consistently governed by analy- 
ses such as the one described, with a fixed value of «, then, in the long 
run, the relative frequency of his raising doubts concerning hypotheses, 
when such doubts are unjustified, will not exceed a Moreover, he can 
also be sure that, in cases when the hypothesis tested H; is wrong, the 
chance of the above method detecting the falsehood of H, is as good as or 
better than that corresponding to any other method insuring the same level 
of control of errors of the first kind. 

The reader may be interested in considering critical regions for testing 
H, against Hz other than the ones suggested in Table III. For example, 
the reader may wish to compute the probability of error of the first kind 
and the power of critical regions whose selection is based on the probability 
distribution of x; and x2 determined by H;. Upon examining Table III 
one might perhaps suggest the critical region w’ composed of all possible 
sample points e for which 


p(e| H1) < .001 (83) 
or, perhaps the critical region w” composed of all points such that 
p(e| Hi) < .005, (84) 


etc. It will be seen that regions of this kind will control errors of the first 
kind to levels comparable to those of regions wo, suggested in Table III. 
However, there will be a marked difference between the two kinds of tests 
in their power to detect the falsehood of H; when the true hypothesis 
is Ho. 








CHAPTER II 
Some Controversial Matters Relating to Agricultural Trials 


Part 1. Randomized and Systematic Arrangements of Field Experiments 


(The contents of this lecture are based on a conference at the Cosmos Club, Washington, 
D. C., held April 7, 1937, under the chairmanship of Dr. Frederick F. Stephan and also 
on some sections of papers published in the Supplement to the Journal of the Royal 
Statistical Society, Vol. 2, 1935.) 


I am going to speak on a very controversial question: Can systematically 
arranged agricultural trials be treated with any success by means of mathe- 
matical statistics? Two eminent statisticians who are also experts in agri- 
cultural experimentation disagree drastically on the answer and each of 
them has a number of supporters. One of these scientists, Professor R. A. 
Fisher, claims that, in arranging field experiments systematically, we are 
bound to obtain all sorts of biases in our estimates and thus to ruin the 
statistical tests. The other scientist is “Student” who can be considered, 
and rightly so, the father of statistical work in agricultural experimentation. 
He does not deny that the formulas usually applied to estimate the experi- 
mental standard error in both randomized and systematic trials are in the 
latter case somewhat biased and tend to overestimate the error. But it is 
his claim that the actual accuracy of a systematic experiment is usually 
greater than that of a randomized one. In his opinion, too high an estimate 
of the standard error is not especially important, since it keeps the experi- 


menter on the safe side. 


Members of the present audience who are familiar with the material 
of my first two lectures are aware that the answer to the question must 
be both empirical and subjective. Since the application of formulas of 
mathematical statistics to the results of agricultural trials presumes the 
existence of a mathematical model of these experiments, the question under 
consideration reduces to one of whether or not the correspondence between 
the model and what happens in actual practice is sufficiently accurate. 
This question is exactly similar to the one mentioned in my second lecture 
(page 23): “Can the formulas of plane geometry be applied to measure 
this or that area on the surface of the earth?” Another similar problem 
(page 28) is whether or not formulas deduced from the Poisson law of 
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frequency can be successfully used to estimate the probability that a colony 
on a Petri plate is produced by a single individual. 

The empirical character of the answer arises from the fact that the 
answer involves trials in conditions of actual practice. The subjective 
character is unavoidable, because, after we have the results of the trials 
and also the corresponding theoretical deductions from their mathematical 
model, we must judge whether the agreement is or is not satisfactory. One 
of the ways by which the insufficiency of plane geometry may be revealed 
consists in subdividing an area of the type it is desired to measure into 
several suitable partial ones and in measuring each of the parts. If the 
measure of the whole appears to be very different from the sum of the 
measures of its parts, then we would say that the assumption that the area 
measured is plane is too crude. But it will be up to us to decide whether 
the disagreement between the two measures is actually large or not, and 
in this respect personal opinions vary. 

Having this in view, I am going to give a short account of the work 
recently done by Mr. C. Chandra Sekar in the Department of Statistics, 
University College, London. This provides the objective empirical part 
of the answer to the question discussed by Fisher and Student. The results 
that I shall describe are of the same character as those contained in my 
second lecture (pp. 30-41): on the one hand you will see figures repre- 
senting frequencies of various results, as predicted from the mathematical 
models of the agricultural trials, and on the other, the frequencies actually 
observed. If the agreement between the two is judged satisfactory, the 
conclusion will be that there is no special harm in arranging the experi- 
ments systematically. If, on the other hand, you find that the agreement 
is bad, you will require an alteration either of the mathematical model or 
of the experimental design. For example, you may decide to randomize 
your trials. 

Now I must enter into details and describe the experiments that I have 
in mind. I shall deal with experiments of a very common type in which 
the plots are rather narrow, long rectangles all arranged in one row. They 
are combined into a few blocks and within each block all the compared 
agricultural objects (varieties or treatments) are distributed in one way 
or another. This is the general description. If we add to this some details 
on the way the objects are distributed within the blocks, we shall obtain 
the full description of the two types of arrangements under discussion. 

One of these is the so-called arrangement in randomized blocks. In this 
arrangement, as you know, each of the objects is repeated in each of the 
blocks the same number of times, e.g. once, and the order in which the 
objects occur within each block is determined by random sampling. If the 
number of compared objects is four and they are denoted by A, B, C, D, 
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then in a randomized block experiment we may find the following distri- 
bution of objects on the successive plots. 


Block I Block II Block III Block IV 


IN RG sD BR ASS SG Sh WE DG AE oP wane, Wy 53 hale aed Coy © 8 





This is one type of arrangement and we know the formula by which we 
can calculate the estimates of the true difference between the mean yields 
which any two of the objects compared, say A and B, are able to give if 
sown over the whole field. It is the difference between the means r4 — xp 
of the observed yields. Also, we know how to calculate an unbiased esti- 
mate s? of the variance of our result. Owing to the fact that the obser- 
vations referring to one block are mutually dependent (e.g., if the object A 
got the best of the four plots, then the object B must have gotten one of 
the poorer plots), the further theory is not entirely clear.t 

It is probable, however, that the application of the ¢ test gives results 
very much in accordance with its theory: i.e., the hypothesis tested, namely, 
that there is no difference between the mean yields of the objects compared, 
is rejected both when it is true and when it is false with relative frequencies 
in good accord with the mathematical tables. 

Many practical agriculturists find that the objects compared are not 
always satisfactorily distributed over the field if the distribution is left to 
chance. For example, they would object to the variety B being sown 
twice on adjoining plots. In their opinion, the conditions in which the 
particular objects are compared should be as equal as possible, and they 
think that this is best attained by some systematic distribution of the 
objects, such as the following. 


Block I Block II Block III 
etc. 


Rees abate. CoD. (A Be CAI) 





Frequently, though not always, a field experiment arranged in the above 
manner is treated statistically by means of the formulas mentioned above, 


1J. Neyman with cooperation of K. Iwaszkiewicz and S. Kolodziejezyk: “Statistical 
problems in agricultural experimentation.” Supplement to the Roy. Stat. Soc., Vol. 2 
(1935), pp. 107-180. 

See also Michael D. McCarthy: “On the application of the z-test to randomized 
blocks.” Annals of Math. Stat., Vol. 10 (1939), pp. 337-359. 
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formulas meant for randomized block experiments. There is no doubt that 
from the point of view of theory this procedure is wrong. The theory of 
randomized blocks assumes specifically that the blocks are randomized and 
its validity is easily shown to depend on this assumption. However, it is a 
question, not of the fact that discrepancies do arise from the disregard of 
this condition, but of the size of these discrepancies between theory and 
practice. 

The above systematic arrangement is very popular in Poland. I spent 
much time and wasted much paper trying to persuade practical experi- 
menters to randomize their blocks, but with disappointing success. Then 
the thought occurred to me that the agreement between theory and practice 
may be attained not only by altering the practice, but also by adjusting the 
theory. Consequently, I produced a paper? giving a statistical theory of 
the agricultural trials arranged systematically.’ 

The general lines are as follows. It is assumed that the natural level 
of fertility along a field may be adequately represented by a parabola of 
some not very high order, say the fourth. If wu denotes the coordinate of 
the center of any of the plots, starting from the left, so that 


Toit Ee rea Puss (1) 
then the true yield of A, if it were tested on the uth plot would be 
A(u) = A+ bu + cu? + du? + euvt, (2) 


where A is a term depending on the object A (treatment or variety), and 
b, c, d and e are unknown coefficients. The symbol A is used here to 
signify both the thing being tested (treatment or variety), and the true 
value (as the yield) of the thing being tested. Experience has shown, how- 
ever, that confusion does not arise, and in fact the symbolism is a very 
convenient one. The true yield of the object B, if it were sown on the 
same plot would be given by 


B(u) = B+ bu + cu? + du? + eut, (3) 


where B depends on the object B but the other constants b, c, d, and e are 
the same as in equation (2). Similar relations are written for C, D, etc., 
b, c, d, and e being the same for all. 

In actual experiments we do not obtain what we call the ‘true’ yields. 
What we obtain is the sum of the true yield plus an experimental error, 


2J. Neyman: The theoretical basis of different methods of testing cereals, Part IT: 
The method of parabolic curves. K. Buszczynski and Sons, Ltd., Warsaw, 1929, 48 pp. 

3In more recent times my formulae were refound by A. Hald. See A. Hald: The 
decomposition of a series of observations. G. E. C, Gads Forlag, Copenhagen, 1948, 
134 pp. 








CONTROVERSIAL MATTERS RELATING TO AGRICULTURAL TRIALS 71 


due to various factors, such as inaccuracies in measuring plots, in treatment, 
damage by birds, etc. My assumption was that these experimental errors 
on particular plots are independent of each other. I then applied the 
Markoff * theorem to get estimates of the differences, B — A, C — A, etc., 
and of their respective variances. 

If the assumptions are granted, the theory is correct. It certainly cor- 
responds more exactly to the practice of systematic experiments than the 
theory of randomized blocks does, but for a long time there was no answer 
to the question of what this correspondence meant in figures. Now some 
numerical evidence is available indicating that the theory does correspond 
to what happens in practice, at least in one particular type of systematic 
arrangement called half drill strip. 

This experimental design was invented by Dr. E. 8S. Beaven® who used 
it with great success while breeding his renowned varieties of barley. The 
half-drill-strip experiments are designed to compare only two objects, say 
two varieties, A and B. The varieties are sown in long narrow plots, half 
the drill sowing A, the other half B. The varieties are repeated in a system- 
atic order as follows. 


SANDWICH I SANDWICH IL SANDWICH II 


fein! kin \ lanebiacioa: Ga be t\ 
! irs. ee rene potion HE WAY OF 
| T 
ABBAS ATSB PAT AIBOUBLA oe Sei 
| | 
\_ p/ se ney betaine 
(4) 


Four consecutive plots form what is called a sandwich, two half drill 
strips with B, sown in opposite directions, are enclosed between two with 
A, also sown in opposite directions. These sandwiches obviously correspond 
to blocks, but the blocks are not randomized. 

It will be useful to distinguish between two possible methods of randomiz- 
ing the blocks of four plots to be occupied by two varieties only. One 
would be a totally unrestricted randomization, allowing arrangements like 


AABB, ABAB, ABBA, BAAB, BABA, BBAA. (5) 
The second kind of randomizing would consist in randomizing the sand- 


4See F. N. David and J. Neyman: “Extension of the Markoff theorem on least 
squares.” Statistical Research Memoirs, Vol. II (1938), pp. 105-116. 

5E.S. Beaven: “Trials of new varieties of cereals.” Jr. of the Ministry of Agriculture, 
Vol. 29 (1922), nos. 4 and 5, pp. 1-28, 436-444, 
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wich. This would admit only two arrangements of the block, either ABBA 
or BAAB, and the choice between them should be based on some random 
experiment such as tossing a coin. 

If the sandwiches are randomized as just described, and if x; denotes the 
difference between the sum of the two yields of A and the two yields of B 
observed on the ith sandwich, then the ordinary theory of randomized 
blocks is applicable to the x; But this is not so certain with respect to a 
systematic arrangement like (4). Of course, the arrangement (4) may be - 
treated by the method of parabolic curves described above. It is a matter 
of an easy adjustment of a few formulas and of preparing tables to facilitate 
the calculations. But here again we come to the question of whether or 
not the scheme underlying the method of parabolic curves corresponds with 
sufficient accuracy to what happens in practice. 

I shall now discuss the question of the empirical data needed for deciding 
whether or not any Spun ies mathematical model corresponds to the 
experiments. 

When comparing any two objects A and B, of which A is some established 
standard, we may desire to obtain evidence that B is better than A. This 
reduces to the test of the statistical hypothesis Ho that the true average yield 
B of B if sown on the whole field, does not exceed that of A, say A. That is, 
Ho is the hypothesis that 

B-—As0. (6) 

Whichever one of the mathematical schemes described is applied, the test 
of Ho consists (i) in calculating the estimate of A = B — A, say @, (ii) in 
calculating the estimate s?/n of the variance of , and (iii) in referring the 
quotient t = Z/(s/ Vn) to Fisher’s table of t. If the observed value of t ex- 
ceeds the value tabled t,, corresponding to some small value of P, say 0.05 
or 0.01, then the hypothesis Ho is rejected and we consider that we have 
“evidence” of B being able to give average yields greater than A. 

The whole question under discussion, i.e., whether or not the field trials 
must be randomized, whether or not the non-randomized trials give any sort of 
bias in the statistical tests, is reduced to the following: 

(1) Whether or not, in cases when the hypothesis tested Ho is true, and, in 


particular, when A = B, the value of t = #/(s/ Vn) calculated by this or 
that method exceeds the fixed value of t, with the frequency a = P/2 pre- 
scribed by the theory. 

(2) Whether or not, in cases when the hypothesis Ho is wrong and thus 
B-—A=A>0, the t test detects this circumstance, the value of ¢ falling 
above the eaiieal t~, With a frequency predicted by the theory. 

If, on any empirical evidence, either of the above two questions were to be 
answered in the negative, then we should say that the mathematical model 
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that served as a basis for calculating t = #/(s/ Vn) does not correspond to the 
actual trials, and that either the model or the experimental design should be 
altered. If, however, a considerable volume of empirical data fails to deny 
either 1 or 2, then the practical man would probably say that, from a purely 
academic point of view (which may be interesting by itself), there may be 
disagreements between the experimental technique and its mathematical 
model, but that these disagreements do not concern him. In fact, the 
statistical test gives all it is expected to give; it rejects the hypothesis tested 
Ho when it is in fact true as frequently as expected, and it detects the false- 
hood of Ho when it is wrong with about the same frequency as predicted by 
theory. 

It is seen, therefore, that the whole question is reduced to what is the actual 
empirical distribution of values of t in cases when A = B, and in cases when 
B—Az=A>0. We must discuss the question of how such empirical 
distributions can be obtained. 

It is easier to obtain an empirical distribution of ¢ for the case when A = B 
than for the case B — A > 0. We have to use for this purpose the results of 
so-called uniformity trials. Imagine a large field divided into a number of 
very small plots, considerably smaller than the ones used for actual experi- 
ments. To avoid misunderstanding, we shall call them elementary plots. 
If you treat all these plots in exactly the same way, so far as possible, and sow 
them with the same variety, you will have a uniformity trial. The results of 
such trials, represented by a plan of the experimental field with the yields of 
single elementary plots, are to be found in various publications. However, 
not all of them are equally suitable for our purpose, mainly because the ele- 
mentary plots used are not sufficiently small, or because they differ con- 
siderably from squares. If the elementary plots are very tiny squares, then 
they can be combined in various ways to form what could be real experimental 
plots. If we wish to see what the results of some particular experiment on 
this field would be, as in comparing some objects A, B, ---, which are in fact 
identical (though we are not aware of it), we simply assign these hypothetical 
objects to particular plots and then perform all the calculations on the figures 
provided by the uniformity trial and apply the tests that we should apply if 
we had to deal with an actual experiment. If the elementary plots are large 
or very long, then the same procedure can be applied; but it may be hard to 
produce experimental plots of the desired size and shape. 

For our purpose we should need uniformity trials with elementary plots 
that could be combined into half drill strips. Suppose that many such 
hypothetical half drill strips are available in the form of a table like the 
following, where each rectangle represents a half drill strip and the figure 
written on it the sum of the yields of the elementary plots of the uniformity 
trial of which the experimental plot is composed. They would be the actual 
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yields obtained on these plots in an experiment with two hypothetical but 
identical varieties A and B. Writing in successive letters A, B, B, A, etc., 
on the plan of the hypothetical experiment (as shown), and applying any 
given mathematical model, we can calculate t, knowing that it refers to the 
case where A = B. A set of such values of ¢, calculated from the results 
of a number of uniformity trials, will produce the distribution we want to 


compare with the theoretical one deduced by Student, namely, 
pit) = CL + 27), (7) 


where t? = 27(n — 1), and n — 1 is the number of degrees of freedom on 
which the estimate s? is based. 

If the sandwiches are randomized, then the estimate of B — A is simply 
the arithmetic mean % of the numbers 2; as defined above, and 


s? (x; TF £)? 


n n(n — 1) 


(8) 


As far as I am aware, the first authors to run tests on uniformity trial data 
to see whether or not the distribution of Z/(s/ Vn) from non-randomized 
sandwiches followed Student’s frequency of t, were S. Barbacki and R. A. 
Fisher. They came to the conclusion that the lack of randomization is de- 
structive to the ¢ test, and they blamed Student for thinking differently. It 
seems to me, however, that Barbacki and Fisher were a little unfair to Student, 
and that the figures they produced are entirely valueless. 

Barbacki and Fisher took just one uniformity trial for which weights of 
yields of wheat on short parts of single rows were published.?’ They 
combined the adjoining rows to obtain the width of a half drill strip. The 
rows were long and they divided them into 12 columns and so obtained 
12 columns of hypothetical half drill strips, each being a continuation of 
the strips in other columns. These columns were interpreted as representing 
the results of six hypothetical experiments comparing some variety A 


6 §. Barbacki and R. A. Fisher: “A test of the supposed precision of systematic arrange- 
ments.” Annals of Eugenics, Vol. 7 (1936), pp. 189-193. 

7G. A. Wiebe: “Variation and correlation in grain yields among 1500 wheat nursery 
plots.” J. Agric. Res., Vol. 50 (1935), pp. 331-357. 
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with another B. Experiment No. 1 would consist of sandwiches in columns 
1 and 7; experiment No. 2 would consist of sandwiches in columns 2 and 8; 
etc., as marked in the figure. The two authors calculated ¢t for each such 
experiment and were pleased to find that, in spite of the fact that the 
hypothetical varieties A and B were identical, the distribution of the 
empirical ¢ was far from similar to the theoretical one. In fact, all values 
of t had the same sign! This, of course, was to be expected because the 
values thus calculated were not independent. It is known that the direction 
of rows is frequently that of ploughing and that in this direction we fre- 
quently observe what I call waves of fertility: if one of the plots in the 
first row is better than the corresponding plot in the second, then this is 
likely to be true for all other plots in these rows. These waves of fertility 
are very marked on the field used by Barbacki and Fisher and consequently 
the value of ¢ calculated for any one of these hypothetical experiments 
could not be much different from the one for any of the others. The whole 
argument is as if we would toss a penny just once, look at it six times and, 
having recorded six heads, argue that the penny must be biased. The 
authors are unfair to Student because he called attention to the fact that 
parts of the same strip are highly correlated.® 


8 Student: “On testing varieties of cereals.” Biometrika, Vol. 15 (1923), pp. 271-293. 
See pp. 286-287 in particular. 


76 MATHEMATICAL STATISTICS AND PROBABILITY 


It follows that we can not accept the results of Barbacki and Fisher as 
conclusive in the question which interests us. Their figures emphasize only 
the known fact that there is danger in replicating an arrangement on plots 
in adjoining columns because an error in one of the columns is likely to 
be repeated in the others. This does represent an advantage for the ran- 
domized arrangements but does not show that systematic experiments, if 
carried out with due precautions, necessarily give biased results. 

There is no doubt, however, that the application of the formula (8) does 
represent a crude treatment. This was recognized by Student who, in a 
paper published in the Supplement to the Journal of the Royal Statistical 
Society, Vol. III, pp. 114-136, 1936, suggested a new way of proceeding. 
This is based on the hypothesis that the level of fertility along the row 
of drill strips is either rising or falling off more or less regularly, so that, 
within each pair of half drill strips, the fertility of the next half drill strip 
differs from that of the preceding one by a fixed quantity, which Student 
called the linear fertility slope. Again, there is no doubt that this assump- 
tion does not correspond exactly to what happens in practice, but the 
formulas that the new mathematical model involves—let it be called the 
new Student’s method—have a greater chance of giving satisfactory results 
than formula (8). In fact, this method along with that of parabolic curves, 
is based exclusively on the assumption that the experiment is arranged 
systematically. Whether or not it works well must be tested empirically. 

Some work designed to throw light on the question in which we are 
interested has been done by one of my students, Mr. C. Chandra Sekar. 
He tried to collect as many uniformity trial data as he could possibly find, 
and on each field he arranged a number of independent hypothetical ex- 
periments in systematic half drill strips. The total number of experiments 
was 120. For each experiment the value of t was calculated twice, first by 
the new Student’s method and then by the method of parabolic curves. 
The distributions obtained are shown in Figures 1 and 2. In each case the 
empirical distribution was compared with the theoretical Student’s distri- 
bution using the smooth test ® for goodness of fit. The symbol P{y? > yo7} 
represents the probability of obtaining by chance an agreement between 
theory and observation worse than that actually observed. For the new 
Student’s method this probability is .173 and for the method of parabolic 
curves, .643. The two graphs and the two probabilities represent the 
empirical part of the inquiry. Whether the agreement between the theory 
and the observation is or is not satisfactory is a subjective question. How- 
ever, I submit that, especially as regards the method of parabolic curves, 
one could hardly expect anything better. 


9J. Neyman: “‘Smooth test’ for goodness of fit.” Skandinavisk Aktuarietidskrift, 
Vol. 20 (1937), pp. 149-199. 
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Figure 1 
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Now let us turn to the question of the effectiveness of the two methods 
in cases where one of the varieties, say B, is actually better than the 
other, A. In relation to this situation and on the assumption that the 
observations are mutually independent and follow the normal distribution, 
the theory of the ¢ test is as follows. 

(i) It has been shown? that the superiority of B over A will be dis- 
covered by the ¢ test more frequently than by any other test. 

(ii) The frequency of the ¢ test failing to detect a difference A = B — A 
when it actually exists and is equal to p times the true standard error o of 
is known and depends on the number of degrees of freedom on which the 
estimate of o is based. This is what is technically called the probability of an 
error of the second kind. The first short table of this kind was published by 
S. Kolodziejezyk." This was later supplemented in a joint paper by 
K. Iwaszkiewicz, 8. Kolodziejezyk, and myself,” wherein certain graphs are 
published, two of which are shown on pages 79-80. Finally, a differently ar- 
ranged table was published by Miss B. Tokarska and myself.” 

In these graphs n means the number of degrees of freedom on which the 
estimate of error variance is based. Further, a means the fixed level of 
significance with which you work. To make the diagrams clear let us 
consider an example. Suppose you are arranging a randomized blocks 
experiment with six treatments and three replications. In this case n = 10. 
From previous experience you know that the standard error per plot is 
likely to be, say, 10 percent of the average yield, and you want to know 
the probability that the experiment will fail to detect as large a difference 
between your treatments as 20% of the general mean. The expected value 
of your o is 10\/% = 8.16. Your A = 20, and p = 20/8.16 = 2.45. From 
the diagram you find that the probability of the ¢ test failing to detect 
the difference between the treatments when it is as large as 20 percent 
of the average yield is about 0.25 if ¢ = 0.05, and about 0.55 if a = 0.01. 
You will probably decide that the experiment planned is not sufficiently 
accurate, and you will try to increase the number of replications. 

Of course, points (i) and (ii) refer to the ideal case of a complete cor- 
respondence between the experiments and the mathematical model involving 
the normal distribution and mutual independence of “errors.”” Our problem 


10 J. Neyman and E. §. Pearson: “On the problem of the most efficient tests of sta- 
tistical hypotheses.” Phil. Trans. Royal Society, London, Vol. 231-A (1933), pp. 289-337. 

118. Kolodziejcezyk: “Sur l’erreur de la seconde catégorie dans le probléme de M. 
Student.” Comptes Rendus, Vol. 197 (1933), pp. 814-816. 

12K. Iwaszkiewicz, S. Kolodziejezyk and J. Neyman: “Statistical problems in agricul- 
tural experimentation.” Supplement to Jr. Roy. Stat. Soc., Vol. 2 (1935), pp. 107-180. 
See pp. 1383-134 in particular. 

13 J. Neyman and B. Tokarska: “Errors of the second kind in testing ‘Student’s’ 
hypothesis.” Jr. Am. Stat. Assoc., Vol. 31 (1936), pp. 318-326. 
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Figure 3 


Diagram showing dependence of probabilities of second kind errors on p and n, 
when a = 0.05 
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is to see whether or not the existing divergences from this model influence 
the validity of the theoretical conclusions. 

With regard to point (i) raised above, there are insurmountable difficul- 
ties in this respect. There is no way to produce empurical evidence that 
in any fixed conditions of experimentation it is impossible to invent a test 
that would be more sensitive than the ¢ test. If any other test were sug- 
gested, then we could produce empirical results comparing its sensitiveness 


to that of ¢t, and this comparison might show that the alternative test is 


better than ¢. But any number of such comparisons, all of them favorable 
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FIcure 4 


Diagram showing dependence of probabilities of second kind errors on p and n, 
when a = 0.01 
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to t, would not prove that the ¢ test is actually the best. For this reason, 
and because no test alternative to t has been suggested, we shall drop the 
question of an empirical test of question (i). 

An empirical test of point (11) is much easier, though it requires a lot 
of calculations. In fact, the problem is very similar to that dealt with in 
the case where A was identical with B. We start by producing what 
could be the results of actual trials in half drill strips, including the actual 
inequalities in soil fertility and the actual experimental errors, in which, 
however, the true average yield of B is greater by a certain amount than 
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that of A. For each such experiment we calculate the value of t and see 
how frequently it fails to exceed the critical tabled value of t, that is to 
say, how frequently the ¢ test fails to detect the advantage of B over A. 
This frequency must then be compared with the probability of an error 
of the second kind to be found in the tables mentioned above or read from 
the graphs on pages 79-80. 

In order to produce the quasi-empirical data for the above purpose we 
use again the same uniformity trials that were used before. I have men- 
tioned on page 73 that on each of the fields with uniformity trials it is 
possible to arrange more than one hypothetical experiment in half drill 
strips. Each of them gives an estimate of the error variance. Several such 
estimates were averaged, and this average was taken as the true value of 
the error variance for the experiments on any particular field. 

To see more clearly what was done next, consider the situation on any two 
particular fields. The assumed true standard deviations of the estimates of 
B — A on those fields are o; and g2, respectively. Using the graphs of proba- 
bilities on pages 79-80, the values p(20), (40), »(60), and p(80) of p were 
found, for which the probabilities of errors of the second kind are 0.20, 0.40, 
0.60, and 0.80. These values of p were than multiplied by o; and o2 to obtain 
what I shall denote by A;(20), A2(20), A, (40), etc., so that, for example, 


A;(20) = o1(20), As(20) = o2(20), ete. 


You will notice that A;(20) represents the value such that if the difference 
between B and A tested on the first field were equal to A,(20), then the 
theoretical probability of the ¢ test failing to detect the advantage of B over 
A would be exactly equal to 0.20. 

Suppose that the values of A;(20), A;(40), A;(60), and A;(80) are calculated 
for the 7th field. Take one of the hypothetical experiments in the systematic 
half drill strips previously arranged on some particular field from data of 
uniformity trials, and add A,(20) to all the hypothetical yields of the object 
B. Before this addition, the variability of yields from plot to plot was due 
solely to soil variation and technical errors, since all the plots were equally 
treated and sown with the same variety. After the addition of A;(20) to 
the yield of the hypothetical B, we obtain what could be the result of an actual 
trial of A and B, including the effect of soil variation and technical errors, 
A — B having the property that whatever the true yield of A, the true yield 
of B is greater by the amount A,(20). That is what we want for testing the 
distribution of t when B — A = A,(20). 

Mr. C. Chandra Sekar calculated ¢ for each of the experiments in such 
systematic sandwiches, obtained in the above way from the data of uni- 
- formity trials. Again, both the new Student’s method and the method of 
parabolic curves were tried. The results, in the form of frequencies of 
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non-detection of the advantage of B over A, both observed and theoretical, ° 
are set up in the following table. 


TABLE I 


Relative frequencies of failure to detect a real advantage of B over A in systematic half-drill- 
strip experiments 


Th ene Student’s 
eory, | of parabolic method, 
percent curves, 
percent 
percent 
20 23.3 27.5 
40 40.8 46.7 
60 62.5 61.7 
80 78.3 75.8 


Again, this is the objective part of the answer to the question of whether 
or not the lack of randomization ruins the ¢ test. The first column gives 
the theoretical frequency of cases in which the ¢ test should fail to detect 
the advantage of B over A. The other columns show what these frequen- 
cies would be in a number of experiments in which the variability of the 
soil and the experimental errors are exactly as they were in actual uni- 
formity trials. Is the disagreement sufficient to say that the ¢ test is of no 
use when applied to the systematic half drill strips? This, as I said, is a 
personal question. So far as I am concerned, the agreement between the 
theory and the empirical results seems to be satisfactory. Especially in 
the case of parabolic curves, the ¢ test both detects the advantage of B 
when this advantage exists and suggests its existence when it does not 
exist with relative frequencies very much the same as indicated by the 
theory. 

In consequence, I do not see any evidence to support the assertion that 
lack of randomization by itself is ruinous to statistical tests. We must, 
however, remember the following points. 

(1) The above empirical results refer to one particular systematic arrange- 
ment in half drill strips: ABBA, ete. It is reasonable that if we take any 
other systematic arrangement, the conclusions suggested by the empirical 
results may be different. If we take the systematic arrangement of blocks 
with more than two objects 





ABCA, ABCD, -=:, 
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then probably the advantage of the method of parabolic curves over the 
ordinary formulas for randomized blocks will be more marked than in the 
case of half drill strips, but this requires an empirical test. 

(ii) The waves of fertility are an important feature that should be borne 
in mind in any case and especially when the trials are arranged system- 
atically. Whenever I was able to ascertain the direction of ploughing, I 
found that the fertility seems to stay steadier along the direction of plough- 
ing than across. It seems to me that the direction of ploughing may be 
the real cause of these waves, but I have no definite evidence of this. 
Sometimes the waves are difficult to detect when you simply look at the 
uniformity trial data. In other instances they are very pronounced. The 
following table gives a part of the uniformity trial data with rye as 
described by Hansen.** Looking at it you will hardly believe that all the 
plots were sown with the same variety and equally treated, but this is a fact. 


TaB_e IT 
Hansen. Yields of rye. Uniformity trial data, 1909 





Probable direction of ploughing 





Imagine now that, without knowing the peculiar fertility level of the 
field, you use this field for an actual experiment and cut your plots along 
the columns. The results would be deplorable. On the other hand, if 
long and narrow plots were cut across the columns, the experiment might 
have been fairly successful. 

If practical circumstances forced one to cut the plots along the columns 
of the above, say four rows deep, so that out of each column we had two 
plots, then it would be most inadvisable to arrange a systematic experiment 
replicated exactly in the two rows, e.g., 


14N. A. Hansen: “Prdévedyrkning paa Fors¢gsstationen ved Aarslev.” Tidsskrift for 
Planteavl, Vol. 21 (1914), pp. 553-617. 
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ABCD, ABCD, °°: 
ABCD, ABCD,’*** 


since the second row would repeat almost identically the same soil errors 
as there are in the first. In such circumstances, a randomized arrangement 
would be most useful. In this sense, the randomized arrangements do have 
definite advantages over the systematic ones. 

Turning to the question of the waves of fertility, I think that from the 
point of view of accuracy of agricultural trials it would be most useful to 
have some indication of their cause. Probably it would not be too difficult 
to make a special experiment to discover whether the direction of the waves 
of fertility is actually connected with that of ploughing. 


Part 2. On Certain Problems of Plant Breeding 


(The contents of this lecture are based on a conference held in Room 4090 of the Depart- 
ment of Agriculture, April 7, 1937, 10 a.m., Dr. S. C. Salmon presiding.) 


The problem I am going to discuss in this conference is a specific one 
connected with the breeding of new varieties of sugar beet. However, I 
believe that it is of wider interest than its restricted nature would indicate. 
Aside from the fact that similar problems arise in breeding other plants, 
there is another and a stronger reason for my choice of this particular 
subject. The point I want to illustrate is this: the methods of mathe- 
matical statistics may be useful not only in treating isolated trials as, for 
example, those discussed in the preceding conference but also in forming 
the over-all policy of an organization. The particular organization about 
which I will speak is a sugar beet breeding establishment, but it can be 
seen that problems of a similar kind will arise elsewhere. 

The idea of the problem originated from contact with sugar beet breeders 
in Poland. However, the results that I am.going to present are due to 
Mrs. Y. Tang, M. Sc., and all of the details are published in her paper 
prepared at the Department of Statistics, University College, London.? 

The process of breeding new varieties of sugar beets is fairly compli- 
cated, but a rough idea of its essence can be obtained from the diagram 
on page 85 which represents schematically five distinct steps. In con- 
sidering these steps, we must remember several important points concerning 
sugar beets. The first is that the sugar beet is a two year plant. During 
the first vegetative season a seedling produces a plant with a big root 
containing a considerable amount of sugar but yielding no seeds. The 
seeds are produced in the course of a second vegetative season when the 


1Y. Tang: “Certain statistical problems arising in plant breeding.” Biometrika, Vol. 
30 (1938), pp. 29-56. 
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Ficure 1 


Plant breeding: scheme of production of new varieties of sugar beets 
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plant uses the food previously accumulated in its roots in the form of 
sugar. The second important point consists in the fact that the sugar beet 
is a cross fertilizing plant, and that this makes it extremely difficult, if 
not impossible, to produce anything like a pure line. Finally we must 
remember that we may have various aims in the production of new varieties: 
we may try to produce (i) beets with highest sugar content, (11) beets with 
the highest yield of roots per acre, or (iii) beets with the highest yield of 
sugar per acre. The discussion which follows applies to all three cases, 
but we shall consider only the first. 

Keeping these points in mind, let us consider the diagram and see what 
are the five consecutive steps leading to new varieties. The first step con- 
sists in choosing from the existing varieties a number of roots which, for 
various reasons, seem to be promising, and in forcing them to cross fertilize. 
For this purpose the roots are planted in pairs on plots isolated from one 
another in a larger field of some cereal. The hope is that the capacity of 
producing high sugar content in old varieties may be increased as a result 
of crosses between them. But it is clear that a cross must sometimes 
increase the capacity of producing a low sugar content. Therefore not all 
of the progeny of the crosses are suitable for further breeding, and we 
have to perform a selection. 

All the seeds produced by the crosses are sown on a larger plot and 
produce roots. ‘These form the material for what is called “individual 
selection,” the second step in our scheme. At the end of the vegetative 
period all the roots are lifted, washed, and weighed. A small portion is 
cut from each root and analyzed for sugar content. This cutting neither 
kills the root nor affects its ability to produce seeds as well as if it had 
been left intact. The majority of roots analyzed are discarded as unsatis- 
factory. The remaining ones, having the highest sugar content or certain 
morphological characteristics indicating that they may be able to produce 
high sugar content, are stored for the winter. Then, in the spring they are 
planted separately on isolated plots to produce seeds, mostly from self- 
fertilization. ‘This is the third step in our scheme. Each of the selected 
roots is called a parent plant, and originates a new variety. 

Obviously each parent plant is able to produce only a very limited 
amount of seed. Therefore, two or more vegetative seasons must be used 
to multiply the seeds of the new varieties, and this is described in the 
diagram as step IV. 

The fifth and last step consists in determining which of the newly bred 
varieties possess an advantage in sugar content over some established 
standard. We must remember that the sugar content of any individual root 
depends not only on the genetical composition of the plant but also, fre- 
quently to a greater extent, on various conditions of environment. Conse- 
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quently the sweetest of the parent plants selected in step II do not neces- 
sarily produce the varieties with the highest sugar content. Also it is pos- 
sible that still sweeter varieties might have been produced by some of the 
roots grown in step II that, owing to uncontrollable variation of environ- 
ment, had small sugar content and were discarded. The field trials (step V) 
are meant to eliminate the individual variability of sugar content in roots 
of a new variety. We may put it otherwise: analyses in V are a compari- 
son of varieties, wherein the properties of individual roots are more or 
less ignored. 

Needless to say, along with the field trials in step V we continue to 
multiply the seeds of the new varieties, and the final decision as to whether 
or not any one of them is a success is made, not after one year, but after 
several years’ trials. However, these are details. 

In any event, after the fifth step is concluded, the breeder has to decide 
which of the new varieties are suitable to put on the market. Other 
families of beets are discarded as failures. 

I must call your attention to certain consequences of the fact that the 
sugar beet is a cross fertilizing plant and consequently that any single 
individual is heterozygous with respect to a number of pairs of genes. One 
consequence is that a plant which is called a “new variety” does not repre- 
sent anything stable, but changes from generation to generation. 

Further, according to a law discovered by Galton and which is a conse- 
quence of the Mendelian laws, the change is unfavorable to the breeder: 
there is necessarily a regression (i.e., a set-back) in sugar content. This 
makes it impossible for the breeder to find just one or two exceedingly 
sweet varieties and keep them for reproduction from year to year without 
further selection. After a relatively short period, the sugar content of 
new generations will drop and the breeder will lose his market. Conse- 
quently, each breeder has to repeat constantly the steps described above, 
perhaps with certain modifications, and to start step I each year, mean- 
while continuing the following steps applied to varieties planted in previous 
years. 

Another consequence of the instability of the varieties is the instability 
of the standard variety, with which the new varieties are compared in 
step V. As each variety changes necessarily from year to year, so must 
the standard change, even if it bears the same label. 

In Poland it is usual to take as standard that variety which in the pre- 
ceding year proved to be the sweetest. The beet sugar industry arranges 
each year competitive experiments with a number of varieties, produced 
by several leading firms. These experiments, carried out in a number of 
places in all the beet growing districts of Poland, are made according to 
a certain fixed method, with the same number of replications, etc. 
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After this somewhat lengthy preliminary, we may turn to the problems 
which the breeder must face in deciding on details of his work. These 
problems are statistical in character and refer to steps II and V. Their 
aim is to see how the breeder is likely to increase his chances of success. 
We must now review some of the possible causes of his being unsuccessful. 

1. The breeder may be unlucky in choosing plants for his crosses in I. 
But this is not a statistical problem. 

2. Supposing that the breeder was successful in I, he may be unlucky 
in II by failing to select for further breeding the roots that have the best 
genetical properties. This is a problem that is partly botanical and partly 
statistical. The statistician may advise the breeder to select for further 
breeding as many parent plants as he possibly can, so as not to omit the 
best ones. I shall call this advice A. 

3. Suppose now that the breeder was successful both in steps I and in IT, 
and, consequently, that some of his new varieties that come for comparison 
with the standard in V are better than the standard. Obviously, again he 
may be unlucky and lose these new varieties. The accuracy of field trials 
is known to be limited and it is just possible that through unavoidable 
errors the experiments will fail to detect the goodness of the best varieties, 
so that eventually they will be discarded. This, of course, would be most 
unfortunate, since it would mean a total waste of a considerable amount 
of effort, money, and time. Here again is a problem for the statistician 
and he will give what I shall call advice B: make your experiments as 
accurate as possible; uf you cannot improve the method of experimentation, 
then increase the number of replications. 

Both advice A and B are sound, of course, but both will seem very 
troublesome to the practical breeder. His means are always more or less 
limited and, before all, this applies to the arable area at his disposal. You 
will notice that each of the advices A and B makes a claim on this area 
and the breeder is faced with the dilemma: to select more roots in step II 
and then make fewer replications in the comparative trials in step V, or 
to select fewer roots, to start fewer varieties each year, and then to com- 
pare them with the standard, using many replications. If he selects too 
few roots in step II, he is likely to have poor material from which to 
choose and he may be unsuccessful even though his trials in step V are very 
accurate. If he starts a great many new varieties in step II, his chances 
of having some good ones are high, but if the trials in step V have few 
replications, the best new varieties may go undetected. 

The decision as to the number of new varieties to be started each year 
and as to the number of replications in the comparative trials to be made 
is just the matter of the breeder’s general policy which I want to discuss. 
This is not a problem strictly limited to plant breeding. Its generality may 
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be judged from the following example taken out of an entirely different 
field. Imagine a squadron of twelve bombers facing twelve ships of an 
invader. Each of the twelve bombers may be directed to attack a separate 
ship. The argument is that, given good luck, all of the twelve ships may 
be sunk. On the other hand, the twelve bombers may be directed to 
attack, say, three selected ships, four planes to a ship. In this case, no 
more than three ships can be sunk, but it is obvious that the chances of 
sinking at least one ship are much better. The problem of the right dis- 
tribution of the attacking air force among available targets of equal priority 
is essentially the same problem of general policy which is faced by the 
plant breeder. 

The policy problem. of the plant breeder was dealt with by Mrs. Tang 
with particular reference to sugar beets. Her results show how to calculate 
approximately the results of plant breeding for any given ratio of the 
number of new varieties and the number of replications used. Of course, 
the final appreciation of the results of such calculations must depend on 
many local conditions. 

It is interesting to note that the solutions of the above problem, advanced 
by practical breeders, most probably on intuitive grounds, differ enormously. 
The number of new families of sugar beets started yearly by Polish breeders 
goes into hundreds, while the number of replications they use is sometimes 
as small as four, and to my knowledge, has never exceeded sixteen. On 
the other hand, the breeders of barley in England and Ireland start with 


only four or perhaps five new families and then test them in 40 half drill 


strips! It is entirely possible that this difference is due to special charac- 
teristics of the two particular plants and also to the cost of land, labor, ete. 
But it is possible also that the general intuition of the practical worker 
was, in one case or in the other, misled. 

Now I must recall the nature of the errors that may be committed when 
testing statistical hypotheses. In doing so, I will treat the particular case 
of the comparison between a new variety V and the standard S._ Denote by 
V and S the true average sugar content that the two varieties would yield if 
each were sown on the entire experimental field and if there were no technical 
errors. We are interested in the difference 


A=V—-S, (1) 


which may be termed the true sugar excess of the variety V over the standard 
or, for short, the sugar excess. If A is positive, the new variety will be con- 
sidered satisfactory. Otherwise it will be a failure. The experiment does 
not give us the true value of A but only the estimate x of A which is always 
affected by a positive or negative experimental error e, so that 


=A+e. (2) 
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Before the variety V is placed on the market, the breeder wants to have 
some “evidence” that it is satisfactory, i.e. that A (not x) is positive. He 
must be particular on this point for frequently, otherwise, he will have 
inferior goods and lose his customers. In this instance, mathematical sta- 
tistics is helpful and provides means by which the frequency of cases when 
A is judged positive without, in actual fact, being positive can be reduced 
to any low level « chosen in advance. «@ is called the level of significance. 

Statistically, the problem of the breeder is reduced to testing the hypoth- 
esis Ho that 


V—S=ASs0. (3) 


If, as a result of our test, we decide to reject the hypothesis Ho, this is 
equivalent to a recognition that we have “evidence” of A being positive, 
i.e. of the new variety being better than the standard. 

The test of the hypothesis Ho consists in the rule of rejecting Hy whenever 


wv 
im > be, (4) 
Ss 


where s is the estimate of the standard error of x and ¢, is a constant number 
taken from Fisher’s tables corresponding to the number of degrees of freedom 
on which the estimate s is based and to the tabled P = 2a. This test was 
originated by Student. 

The properties of this test are: (i) whenever the new variety is barely 
as good as the standard, i.e. when A = 0, the hypothesis tested will be 
rejected (this is equivalent to placing an unsatisfactory variety on the 
market) with a relative frequency equal to «; (11) whenever Ho is true and 
the new variety is worse than the standard, i.e. when A < 0, the relative 
frequency of rejection will be even smaller than «; (iii) whenever Ho is 
wrong and the new variety is superior to the standard, i.e. when A > 0, 
then the above test will detect this circumstance more frequently than 
any other imaginable test having properties (i) and (ii).? 

We must be clear on this point and, therefore, let us consider some: 
numerical illustrations. One breeder A may desire that the proportion 
of his unsuccessfully bred varieties which reach the market should not 
exceed 5 percent. In this case, the level of significance being « = 0.05, 
he finds in Fisher’s Table IV the value of t corresponding to P = 2% = 0.10. 
If the number of degrees of freedom is 12, then ¢ = 1.782. Thus he will 
reject the hypothesis Hy and say that his variety is good enough to be 
put on the market when x >1.782s. Another breeder B may consider 
that to allow 5 percent of his unsatisfactory varieties to go on the market 


2 J. Neyman and E. §. Pearson: “On the problem of the most efficient tests of statis- 
tical hypotheses.” Phil. Trans. Roy. Soc., London, Vol. 231-A (1933), pp. 289-337. 
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is too great a risk; he may consider that the proportion of such varieties 
should not exceed 1 percent. In such a case, he would put « = 0.01 and 
select ¢ corresponding to P = 2a¢ = 0.02. On this basis he would let through 
his new variety only if x >2.681s. Other breeders may be even more 
cautious. 

QugEsTION BY Dr. Sars: Is there any danger of being too cautious? 

ANSwER: Yes, there is, and I am most grateful for the question. The 
danger consists in that, whenever we are too particular in trying to avoid 
unjust rejections of the hypothesis tested, i.e. rejection when it is in fact 
true, then we are exposing ourselves to an increased risk of failing to detect 
cases when the hypothesis is false. This problem is sufficiently important 
to justify a little digression. 

It will be convenient to use the special terminology introduced in Chapter I, 
Part 3, to distinguish between the two kinds of error that we may make when 
testing a statistical hypothesis and, in particular, when judging whether a 
given variety is or is not better than the standard. If, as a result of a test, 
we reject a hypothesis when in fact it is true, we say that the error committed 
is of the first kind. Thus, when the breeder puts on the market a variety that 
does not exceed the standard, he commits an error of the first kind. On the 
other hand, an error of the second kind consists in accepting the hypothesis 
tested when in fact it is false. ‘Thus, when the breeder does not find sufficient 
reason for judging his variety satisfactory (i.e. when xz/s < t,), whereas his 
new variety zs actually sweeter than the standard (i.e. A > 0, though he does 
not know it), he commits an error of the second kind. 

Errors of the first kind are dangerous to the trade of the breeder, but then 
so are errors of the second kind. It must be remembered that each rejec- 
tion of a satisfactory variety means a complete waste of effort and money 
spent for a substantial number of years: after all the years of work a 
variety exceeding the standard in sugar content is successfully produced 
and then an error of the second kind causes this variety to be discarded. 
Thus it is necessary to have as clear an idea as possible regarding the 
chance of committing an error of the second kind. Numerical evaluation 
of the probabilities of errors of the second kind are based on charts repro- 
duced in the preceding chapter. 

In the present notation, the “standardized” error of the second kind is 


p= ==. (5) 


This is the true value of A divided by the true value of c, where oa is the 
true standard error of x (not the estimate s of o). 

To illustrate the use of the diagrams in answering the question raised 
by Dr. Sarle, we suppose that the arrangement contemplated for a future 
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experiment is in randomized blocks with three varieties and six replications 
which gives n = 10 degrees of freedom. We suppose further that previous 
experience indicates that o may be taken as something like 0.5. In these 
circumstances, let us see what would be the chance of detecting that a 
particular variety is better than the standard when A is actually positive 
and as large as 1 percent. In order to answer this question, we calculate 
p = A/o = 2 and refer to the curves corresponding to n = 10 on pages 79- 
80. We see that if we use the level of significance «a = 0.05 (Figure 3, 
page 79), the probability of an error of the second kind is about 0.42. On 
the other hand, if a = 0.01 (Figure 4, page 80), the probability of this is 
0.65. This means that, if the true value of the mean excess is as large 
as 1 percent and if we use alternatively « = 0.05 and « = 0.01, then in 
the circumstances of the experiments the mere existence of the advantage 
of a new variety over the standard will be detected in only about 58 or 35 
cases, respectively, out of a hundred. From this you can see how the 
excess of caution with respect to errors of the first kind (0.01 in place of 
0.05) leads to an increased chance of committing errors of the second kind 
(65 out of 100 in place of 42 out of 100). 

Returning to the main subject of the conference, we notice that the 
eraphs describing the dependence of the probability of errors of the second 
kind on the value of p and 7 are relevant from the point of view of the 
problems in plant breeding which we are considering. In practice, after a 
few years of existence, any seed breeding establishment must be aware of 
the size of the standard error per plot, say oo, which is likely to hold in 
future experiments. It is impossible to predict the exact value of oo, but 
it is certainly possible to make rough estimates of its upper limit. There- 
fore the breeder who contemplates experiments with m replications is 
able to substitute some reasonable number for o into the expression for 


p = A/o, taking 
» 
eee URN ee (6) 
m 


He may then use tables or graphs of probabilities of errors of the second 
kind to find out what approximately will be his chance of detecting the advan- 
tage of his varieties when A = V — S has any value in which he may be 
interested. If he finds that, given a certain value of m, this chance is too 
small, then he will consider increasing the number m of replications. The 
increase of m will decrease the value of c, increase the value of p, and conse- 
quently decrease the probability of an error of the second kind, i.e. the proba- 
bility of failing to detect a good variety. This procedure must be considered 
essential in any rational planning of experiments. 
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But in the case of the plant breeder a special difficulty arises. Suppose he 
finds that, with five replications and a = 0.05, the probability of detecting a 
good variety for which V exceeds the standard S by 5 percent is fairly large, 
say 0.9. It will be seen that this result is not very helpful. In fact, it is 
difficult to say beforehand how frequently his steps I through IV (page 85) 
will yield him new varieties which exceed the standard in sugar content by 
as much as 5 percent. It is possible that such success in breeding is unthink- 
able and that usually A does not exceed, say, one-third of a percent. 

If one looks at the above mentioned graphs, it is easy to find that in 
such a case the chance of the breeder detecting the goodness of any of his 
varieties will be very small. Thus, if he keeps arranging his experiments 
with only m = 5 replications, practically all of his efforts in breeding new 
varieties will be wasted. 

It is seen that the solution of the breeder’s problem requires knowledge, 
not only of the probabilities of errors of the second kind, but also of the 
distribution of A in the population of new varieties which the breeder is 
likely to obtain in the future. It is impossible to predict what will happen 
in the future but it is possible to make rough guesses by studying what has 
happened in similar circumstances in the past. We may try to estimate 
the distribution of A in past years and use these estimates to obtain an 
idea of what may happen in the future. 

The problem may be stated as follows. In some particular year, M 
experiments, comparing a large number N of new varieties with the same 
standard, gave N estimates, 71, %2, -*+, Yy, of sugar excesses corresponding 
to the N varieties and M estimates, s1, Se, ***, Sy, of standard errors cor- 
responding to the M experiments. It is required to use these numbers to 
estimate the distribution, say p(A), of the true excesses Ay, Ag, ++, Ay, of 
the new varieties. 

A similar problem was considered previously by Eddington and the 
- solution is quoted by Levy and Roth.* However, Mrs. Tang offers a new 
approach. Her method consists of the following. 

Denote by px and 1; the kth moments about zero of x and A respectively, 
and by o? the variance of the experimental error « in the observations z. 
If the traditional assumption is made that e is normally distributed, ee 
as Mrs. Tang has calculated, 


HL = 71; 
2 
Bo 21S) 
(7) 
M3 = V3) 
c=, 2 4 
pa = v4 — B0%rg + 80". 
8H. Levy and L. Roth, Elements of Probability. Oxford University Press, 1936, 
200 pp. 
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Mrs. Tang also uses the assumption that o has the same value in all M 
experiments. The use of this second assumption is partly justified by the 
fact that all the experiments are carried out by the same staff on the 
same large field with varieties which have many similar properties. The 
common value of o can be estimated then with great accuracy since the 
estimate will be based on hundreds of degrees of freedom. This estimate, 
s’, may be substituted in (7) for o. Next, the observed values of x can 
be used to estimate the moments 14, v2, v3, v4. Together with s’, they will 
yield the estimates of pi, 2, w3, ws. Finally, having obtained the p’s, Mrs. 
Tang uses them to fit a Pearson curve which is considered to be an estimate 
of p(A). 

It is difficult to test the efficiency of this method theoretically. However, 
Mrs. Tang tried an empirical test. She started with an arbitrarily selected 
distribution represented by the histogram in Figure 2 (shown below). She 


Figure 2 


Histogram True distribution of A | 
Estimated distribution of A ! 





sean 5-5 -- Estimated distribution of x 











CONTROVERSIAL MATTERS RELATING TO AGRICULTURAL TRIALS 95 


considered the histogram as the true distribution of N values of A in some 
possible two experiments. Next she used Mahalanobis’ table* of normal 
deviates to produce values of x such as experiments with N new varieties, 
when her assumptions are satisfied, might have produced. In a similar 
way she obtained M values of the estimate of the error variance, each 
corresponding to one hypothetical experiment. After she had obtained 
these quasi empirical figures, she applied her method to estimate the dis- 
tributions of A and xz. Figure 2 shows the results. It is seen that the 
continuous curves do agree with the “true distribution” represented by the 
histogram. 

QueEsTION By Dr. Sante: I am wondering what you used for a check. 

Answer: I will explain it again. Let us assume that the true distribution 
of A is as follows: 


Value of A —6 —5 —4 —3 —2 —1 0 1 Jigar ution Oa PO 


Frequency 1 3 5 it) 12 log 4 le) 12 eo. Se hee eee 


It is seen here that one of the A’s is equal to —6, three others are equal 
to —5, etc. Write down the A in one column, thus: 


A; = —6, 

A, = —5, 

: i Es (3) 
A; = —4, 

etc. 


Next take from the table of Mahalanobis the corresponding number of 
values of «; these values are so tabulated that they may be considered as 
values of a normal variate about zero with unit standard deviation. Sup- 
pose that you find 


6 = 0.03, 

é. = —1.16, 

eg = —0.25, (9) 
= 0.53, 

etc. 


4P.C. Mahalanobis: “Tables of random samples from a normal population.” Sankhya, 
Vol. 1 (1934), pp. 289-328. 
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Now add these numbers to your A; and you will obtain values which 
might be given by experiments if the true o were unity and if the true A 
were distributed according to the above table. The results, 


xz, = —6+ 0.03 = —5.97, 


to = —5 — 1.16 = —6.16, 
tg = —5 — 0.25 = —5.25, (10) 
t, = —5 + 0.53 = —4.47, 

ete. 


may now be used to estimate the distribution of z by the method of Mrs. 
Tang. Figure 2 represents the results. 

You may have noticed that among the hypotheses of Mrs. Tang there 
is one that is doubtful. This is that the value of o is the same in all 
experiments. Actually, in dealing with the results of real experiments it 
was found that this hypothesis may not be true. So Mrs. Tang checked, 
again empirically, that her method was still applicable with o varying 
from one experiment to another within limits likely to occur in practice; 
Figure 3 shows the fit obtained with varying co. 





Figure 3 
Histogram True distribution of A 
a Estimated distribution of A 
b --------+---- Estimated distribution of x 


Ce 6 Estimated distribution of A 
(Variation of « = 20 percent of mean oc) 
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Having thus obtained an indication that her method does lead to reason- 
able results, Mrs. Tang applied it to the problem of estimating the distri- 
bution of true sugar excess over the standard in a number of new varieties 
tested in 1923 and 1924. The varieties were produced and tested by the 
breeders, K. Buszezynski and Sons, Ltd., of Warsaw, who kindly supplied 
the numerical data from their trials. Out of a considerable number of these 
trials, Mrs. Tang selected 40 carried out in 1923 and an equal number car- 
ried out in 1924. These were convenient as they had the same number 
of replications, namely 5. In each of the two sets, 120 new varieties were 
compared with the standard in a systematic arrangement like this: 


S Vi Ve Vs S Vi V2 V3 8 Vi V2 V3 8S Vi V2 V3 8 Vi V2 V3 8 (11) 
To work out these experiments, i.e. to calculate the estimates x; of the 
sugar excesses A;, and the corresponding standard errors, Mrs. Tang applied 


the method of parabolic curves.®> Next she estimated the distribution of A, 
the true sugar excess. Figure 4 gives the result referring to 1924. Here 


Ficure 4 
Estimated distributions of sugar excess, 1924 


Histogram Observed excesses of sugar content of 120 varieties over the standard 
—----------- Estimated excesses of sugar content of 120 varieties over the standard 
True excesses of sugar content 





Ge) Hm) Ja iele2uinch0n 238 (by liedp) |e 2 0 ie 4 26 so 
EXCESS OF SUGAR IN PER CENT 


5 See conference on randomized and systematic experiments, pp. 67-84. 
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the histogram represents the observed distribution of x and the continuous 
curve the estimated distribution of A. 

Similar curves calculated by the breeder may give him diversified and 
important information which I shall classify under two headings. 

1. He may use such curves in analyzing his method of selecting parent 
plants, step II, page 85. If the breeder has recorded how he selected his 
plants a few years ago, he may usefully study what the distribution of A 
would have been if the selection had been made differently, say by breeding 
only half of the families that were actually taken. This would have 
allowed him to make a stricter selection of parent plants, taking only the 
very sweetest. If he ignores the new varieties that have been bred from 
the parent plants assumed to have been discarded in such cases and esti- 
-mates the distribution of A for the remaining varieties, the breeder will 
be able to see whether or not the taking of many parent plants and the 
breeding of many new varieties does, in fact, represent a marked advantage. 

2. When the breeder has the estimated distributions of A corresponding 
both to his actual experiments and also to the stricter method of selection 
at step II, he will be able to use the probabilities of errors of the second 
kind to see what the final results of his efforts, including step V, would 
have been. Let us illustrate this for the estimated distribution of A given 
in Figure 4. 

The breeder is naturally interested in the varieties, conventionally called 
“good” varieties, for which A > 0. Their proportion is represented by the 
area of the curve to the right of the origin. The breeder will be interested 
to know what proportion of “good” varieties is likely to be detected as 
“good” by his field trials when they are arranged according to this or that 
plan. 

Take any positive value of A within the range of the curve in Figure 4, 
calculate the corresponding value of p = A/o and determine the probability 
of an error of the second kind, corresponding to the value of p and to the 
number of degrees of freedom considered for the trials. Subtract this prob- 
ability from unity and you will obtain the approximate value of the pro- 
portion P(A) of good varieties that will be detected as “good” by the 
proposed trial. 

Now calculate P(A’) for a number of successive values A’ of A. Next, 
for these values A’, take the estimated ordinate p(A’) of the distribution of A 
in the population of your new varieties (as for example, the full line curve of 
Figure 4). This ordinate multiplied by 6A is approximately equal to the 
proportion of your varieties for which A falls between A’ and A’ + 6A. Then 
the proportion of the new varieties that (a) have their sugar excess V — S 
between A’ and A’ + 6A, and (b) will be detected as good varieties by the 
field trials planned will be obtained by the multiplication 
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Figure 5 was constructed in this way for the good varieties of Figure 4, 
i.e., it was made up from that part of the estimated distribution of A in 
Figure 4 lying to the right of the origin. The uppermost curve a of Figure 5 
is simply the full line curve lying to the right of the origin in Figure 4. 
The dimensions are reduced so that the area under this part of the curve 


FIGURE 5 
Distributions of true sugar excess 


a In population of varieties tested 
b In population of varieties found significant at a = .05 
ce In population of varieties found significant at a = 01 
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is equal to unity: we are interested only in “good” varieties and in the 
proportion likely to be detected as such. 

The lower curves labeled b and ¢ represent plotted values of the products 
p(A)P(A), where P(A) corresponds to « = 0.05 or 0.01 and to different 
arrangements of the proposed experiments. It was assumed that all the 
experiments had been arranged in randomized blocks and differed only in 
the number of replications m, marked on each curve. The curves b cor- 
responding to « = 0.05 end at the point 0.05 on the axis of ordinates. The 
other curves c corresponding to « = 0.01 have this ordinate equal to 0.01. 

The area under each curve represents the proportion of “good” varieties 
that will be recognized as such for a given « and a given number m of 
replications. In addition the curves give the distribution of A for the 
“good” varieties that will be detected. You will see that if the stricter 
level of significance « = 0.01 is applied and if the number m of replications 
is as small as 5, then the proportion of good varieties that will be detected 
is very small. You will find its value, 16.6 percent, on the small table 
attached to Figure 5, page 99. This number 16.6 percent, is the area 
under the curve for ¢ = 0.01 and m = 5, divided by the area under the 
curve marked a. On the other hand, if « = 0.05, then the same proportion 
rises to 34.3 percent. If the number of replications is doubled, then the 
corresponding figures will be 31.9 and 48.5 percent, respectively. 

Apart from the proportion of good varieties likely to be detected, the 
breeder may be interested in the proportion of those for which the value 
of A is not merely positive but exceeds some arbitrary limit, say 0.2 percent 
of sugar. Such varieties may conventionally be termed the “best.” There 
is no difficulty in calculating the proportions of the “best” varieties whose 
superiority over the standard would be detected by the trials. We have 
only to use the areas of all the curves to the right of the line A = 0.2. The 
corresponding figures are given in the two “best” columns of the table 
attached to Figure 5. For instance, in the table under “Probability of 
detecting a best variety,” at « = 0.05 and m = 8, we see 0.696. This means 
that the area to the right of 0.2 percent under the curve for « = 0.05 and 
m = 8 is 0.696 of the area to the right of 0.2 percent under the curve 
marked a. 

_ Figure 5, the table and the method of construction represent the main 
result of the work of Mrs. Y. Tang. The breeder who now starts 500 new 
varieties each year and replicates them only 5 times in his trials may use 
her results to construct curves similar to those in Figures 4 and 5, and 
may thus compare the probable results of his work for the cases in which 
he started, not with 500 families, but perhaps with 400, 300, 200 and a cor- 
responding increase in the number of replications. Having these results 
before his eyes he will be able to take into account various economic 
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factors and choose the most economical relation between the number of 
replications and that of the new families started. 

I might conclude here, but it seems advisable to warn the reader that 
the actual process of seed breeding is a little more complex than that 
presented above. In fact, it is extremely difficult to include in formulas an 
exact process of any more or less complicated practical work. This is also 
the position in the present case. In order to give an idea of what I have 
in mind, I may remind you of one thing that I have already mentioned— 
new varieties are tested for more than one vegetative period and in more 
than one spot. It follows that the method built by Mrs. Tang refers to 
a simplified case. But also it is obvious that she has contributed to our 
technique by showing how to calculate the probable results of only one 
series of field trials when no such method existed before. And even 
though this is not all that is needed, it is a great deal because the most 
difficult part of any problem consists in noticing that there is a problem 
and in advancing some sort of solution. There are usually a lot of people 
able to introduce the necessary corrections and extensions. 

QUESTION BY DR. SARLE, POINTING TO FicurEe 5: What basis do we have 
for figuring the possibility of including some “false good” varieties in this 
area? Will all poor ones be eliminated by this process, or is there a 
chance of getting some of the poor ones? 

Answer: Figure 5 refers only to those varieties that are really “good.” 
The control of “false good” varieties is kept by choosing a proper level of 
significance. If you fix « = 0.05, then the chance that the best out of the 
“false good” varieties (those with A = 0) will be passed as good is 0.05. 
On the other hand, the areas under sections of the curves in Figure 5 give 
the proportions of the varieties that are really “good.” 

QuEsTION BY Dr. SarLE: Your method automatically does that? 

Answer: In principle, yes: but we must remember that the method gives 
only an estimate which is always liable to error. 

Question By Dr. SarLte: How does it know which one to pick out? 

Answer: It doesn’t. It would be a great thing if it did. All that it can 
do is to estimate proportions. If you toss a fair penny, you can never tell 
exactly when it will fall heads. On the other hand, you can safely say that 
in the long run the proportion of heads will be about one-half. Similarly, 
no statistical method is able to indicate which of the varieties with positive 
zx is really “good” and which is “false.” On the other hand it is possible 
to estimate the proportion of those that are really “good” and also the 
proportion of their number which will be detected as “good.” 

QursTion BY Dr. Saumon: This means that with five replications you 
actually identify only a relatively small percentage of the total number 


of good varieties. 
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Answer: Yes, a very small percentage. But we must remember that 
the accuracy of experiments varies a great deal from year to year, owing 
to weather conditions. As a matter of fact, in the year 1923 which was 
also studied by Mrs. Tang, the proportion was found to be much greater 
than that indicated here. 














CHAPTER III 


Some Statistical Problems in Social and Economic Research 


(This chapter is dedicated to the memory of Dr. Kazimierz Karnitowicz, the late 
Director of the Institute for Social Problems, Warsaw, Poland.) 


Part 1. Sampling Human Populations. General Theory. 


The present section is based on a conference held in the Auditorium of 
the United States Department of Agriculture, April 8, 1937, Dr. Frank M. 
Weida presiding. At this conference, I summarized my general ideas on 
sampling human populations and gave some theoretical results which I had 
obtained in connection with a sampling survey of Polish labor conducted 
by the Institute for Social Problems in Warsaw. The methodology devel- 
oped was originally published in Polish.t Later on, the main theoretical 
results were incorporated into a critical survey of sampling methods, pub- 
lished in English.? The numerical results obtained in the sampling survey 
by the Institute for Social Problems were published by Jan Piekatkiewicz.? 

The subject of the conference of April 8, 1937, was selected as a result 
of numerous letters received from prospective members of the audience. 
All these letters visualized the following general situation: a certain amount 
of money is available for a survey and the problem is to determine the 
sampling procedure which will make the best use of these funds. Such 
differences as were present in the several particular problems described in 
the correspondence referred to specific circumstances of sampling and to 
special characteristics of the population studied. The purpose of the con- 
ference was to develop some general ideas from which answers to a number 
of particular questions could be derived. 

One of my correspondents had in mind a population of 300 cities. In 
order to study this population he intended to select a sample of 25 cities 


1J. Neyman: An Outline of the Theory and Practice of the Representative Method 
Applied in Social Research. Institute for Social Problems, Warsaw, 1933, 123 pp. 
(Polish). 

2 J. Neyman: “On the two different aspects of the representative method.” Jr. Roy. 
Stat. Soc., Vol. 97 (1934), pp. 558-625. 

3 Jan Piekatkiewicz: Report on the Study of the Structure of Polish Labor. Institute 
for Social Problems, Warsaw, 1934, 238 pp. 
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to be covered by a 100 percent enumeration. His question was how best 
to select 25 cities from the 300 so that he could draw conclusions regarding 
the inhabitants of all the 300 cities under investigation. 

I am not going to answer this question. Instead I am going to advise 
as strongly as I can that the proposed method of sampling be dropped 
altogether. This method is most dangerous and is practically certain to 
lead to deplorable results. Of course, I do not mean by this that a success- 
- ful inquiry by means of sampling is impossible. On the contrary, it is my 
opinion that the sampling method is useful and can provide very accurate 
results. What I emphatically protest against is the selection of any 25 cities 
for a complete.census (100 percent enumeration) with the consequent total 
omission of the remaining 275 cities. 

Broadly speaking, there are two essentially different methods of sampling 
used in social work. One is called the method of purposive selection, the 
other that of random sampling. This subdivision is a little artificial, but 
owing to the fact that it was used in a special report * on the method, pre- 
sented to the International Statistical Institute, it is generally accepted. 

The method which consists in selecting 25 out of 300 cities and in limiting 
the investigation to these 25 cities falls under the heading of ‘purposive 
selection.” The mere question of how one should best select these cities 
suggests that the selection was not meant to be random, at least not 
entirely random. Usually it is suggested that the sample of cities shculd 
be selected so that the averages of certain characters, called controls, calcu- 
lated for the sample and for the universe should be in as close agreement 
as possible. It is this circumstance which justifies the term ‘purposive 
selection.” But it is not the limitation of the randomness of sampling 
which makes the method dangerous. In fact, if the question concerned 
only random sampling, I could easily answer it by saying that the best 
way of selecting the 25 cities is to draw them at random. 

The trouble with the method lies in the fact that if we try to select 
things (cities, districts, etc.) “purposely,” then both the total number of | 
units from which selection is to be made, and even more inevitably, the 
number selected must necessarily be small, and therefore the units them- 
selves must be rather large. In the present case we have 300 units out of 
which only 25 are to be selected. Each unit of selection is a city inhabited 
probably by tens of thousands of people, possibly more, and the differences 
between the units may be enormous. This is a rough description of the 
method called “purposive selection.” 

The nomenclature “purposive selection” and “random sampling” is not 
very felicitous, as I have already indicated. It does not describe the 


*L. A. Bowley: “Measurement of the precision attained in sampling.” Bull. Inst. Int. 
Stat., Vol. 22 (1925), ler livre, pp. 1-62, supplement. 
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essential difference between the two methods when they are applied in 
practice. The first method, that of “purposive selection,” consists in 
dividing the whole population into a comparatively few (say 300) large 
groups (e.g. cities) or units, of which some 20 or 30 are selected ‘‘purposely.” 
The essential feature of the other method is that the same population is 
divided into a much larger number (say 100,000 or more) of small groups 
(e.g. families, inhabitants of single houses, blocks, ete.) from which a 
sample of around 1000 or more are selected, either entirely at random or 
at random with some restrictions. 

The first method is hopeless, the other extremely useful. If anyone 
would like to see theoretical reasons for this opinion, he will find them in 
my paper published in the Journal of the Royal Statistical Society, already 
quoted. In this conference I will give an intuitive illustration of the ideas 
expressed there. Suppose we have a hundred dollars that we decide to use 
for gambling in a fair game. If we divide the whole sum into, say, five 
parts of $20 each and bet only five times, it is impossible to make a reliable 
prediction of what the result may be. We may lose all our money, or 
equally easily, we may double it. On the other hand, if we make a hundred 
~ bets at $1 each, then we can make some predictions with fair hope of 
success. The result of the game still remains uncertain, but it would be 
rather surprising if the sum won or lost exceeded $20. The accuracy of 
the prediction would be still greater if, instead of making a hundred bets 
at $1, we would make a thousand bets of a dime. 

These are perfectly intuitive propositions and you will notice that they 
have a definite bearing on the problem of sampling human populations. 
The advice against selecting 25 cities out of a total of 300 is not based on 
theoretical considerations alone: some practical experience is available to 
show what the result of an inquiry might be if this method is applied. 

In 1926 or 1927 two Italian statisticians, Gini and Galvani,> had to 
solve a problem of a kind that is exactly similar to the one contemplated 
here. They had to deal with the data of a general census. The data were 
worked out, a new census was approaching, and the room had to be cleared 
for the new data. The old data were to be destroyed, but the statistical 
office wanted to keep a representative sample so as to have material for 
future studies, as yet unanticipated. Gini and Galvani were responsible for 
the method of obtaining a sample which would represent the situation in 
the whole of Italy. What they did is a good example of how not to 
sample human populations. 

The two authors carefully considered the problem, took into account 


5 Corrado Gini and Luigi Galvani: “Di una applicazione del metoda rappresentativo 
all’ultimo censimento italiano della popolazione.” Annali di Statistica, Serie vi, Vol. 4 
(1929), pp. 1-107. 
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the Report of the International Statistical Institute and decided to apply 
the method of purposive selection. The whole of Italy was divided into 
214 administrative districts called circondari and out of these 29 circondarit 
were selected to form the sample. Some of the circondari are large districts 
with more than a million inhabitants. It is interesting to note that the 
ratio of the Italian sample to the universe sampled, 29:214, is substantially 
larger than the sampling ratio contemplated by my correspondent, 25:300. 

Various averages for each circondario had been calculated previously. 
Gini and Galvani selected 12 characters of the circondari to serve as con- 
trols and subdivided these into essential and secondary controls. They 
tried to select the 29 circondari so that the means of the essential controls 
calculated from the sample would be practically identical with those for 
the whole population. They also tried to reach a reasonable agreement 
between the population and the sample means of the secondary controls. 
If you will look at the figures, you will find that the agreement of the mean 
of each control in the sample with the mean of the same control in the 
population is very good. 

From the paper by Gini and Galvani, it is uncertain whether or not the 
old Italian census data were destroyed and the sample was left for future 
reference. However, the two authors decided to check the goodness of the 
sample by comparing its various characteristics with those known for the 
whole population of Italy. The results of this comparison are described 
by Gini and Galvani and should be kept in mind as an argument against 
the use of the purposive selection method. Gini and Galvani found that 
the distributions of various characteristics of the individuals, the correla- 
tions, and, in fact, all statistics other than the average values of the controls 
showed a violent contrast between the sample and the whole population. 
Figure 1 reproduces a diagram taken from page 95 of the paper by Gini 
and Galvani, which illustrates the situation. You will see that the distri- 
bution observed in the sample bears little resemblance to that of the whole 
population. 

Having discovered that their sample of 29 circondari is not at all repre- 
sentative of the whole population, the Italian statisticians expressed the 
opinion that, generally, it is impossible to obtain a sample that reproduces 
the population sampled and all its properties. Strictly speaking, they are 
correct. In 1926 there was in Italy but one Marchese Marconi, the great 
inventor in the field of wireless telegraphy. Whatever the method of 
sampling, the proportion of Marconis in the sample can not be equal to 
that in the population. But we do not take samples to establish such pro- 
portions; and both theory and experience indicate that, whenever we have 
in mind a truly statistical problem of estimating means of any size, of 
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Vol. 4, 1929) 


regressions, etc., a properly drawn sample is, for all practical purposes, 
sufficient. 

Now let us consider what is to be done to get a reliable sample. Here 
we must rely on the theory of probability and work with great numbers. 
“Great numbers” does not mean great numbers of people included in the 
sample, but great numbers of random selections to form a sample, or great 
numbers of units that are drawn separately. The sample of 25 cities or 
the sample of 29 circondari contain a great number of people, but from the 
point of view of sampling theory they are both small samples because they 
are composed of 25 or 29 units, respectively. For a sample to be reliable 
the number of units must be large. 

Thus, instead of dividing your population into 300 parts, each representing 
a particular city, you need to carry the subdivision much farther. Probably 
it would be best to divide the whole population of 300 cities into small 
groups inhabiting single houses or blocks. All these groups, which I shall 
call units of sampling, or simply units, must be listed, and the necessity 
of listing usually imposes a limit to the tendency of having the units very 
small. 

When the population sampled is represented by the mass of records 
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obtained by a general census, then the smallest unit you can choose con- 
veniently is the smallest division for which there is a separate folder in 
your data, or for which you have a separate punch card. This division 
may represent a block, a household, etc., the smaller the better. Ordinarily, 
since such divisions are small, there is a great number of them in the 
population studied. Therefore no great difficulty occurs in sampling from 
existing records. The situation is much more difficult if you are to sample 
people, not records already collected. In this case you have to send 
enumerators into the field and give them addresses at which to call. In 
order to insure randomness of the sample, the addresses must be selected 
truly at random and this requires a previously compiled list of all addresses 
forming the population. Frequently such a list is unavailable and this 
causes considerable difficulty. However, this difficulty may be overcome 
in part by using a map that divides the area under investigation into a 
large number of small sections and by considering each section as a unit 
of sampling. Whatever the selected unit, it is a relatively simple matter 
to produce a random sample of any preassigned size once you have a com- 
plete list of the units forming the population. 

Several questions addressed to me were concerned with what proportion 
the size of the sample should bear to the size of the population. This pro- 
portion does affect the precision but in a much milder way than the 
number of units selected to form the sample. Thus, a sample of 10,000 units 
(blocks, inhabitants of separate houses, etc.) will be very accurate almost 
irrespective of whether it forms 10 percent of the population studied or 
one percent or one-tenth of a percent. 

The process of random sampling may be of various forms which are not 
equivalent from the point of view of the accuracy of the results. The first 
attempt at a serious study of the relation between the method of sampling 
and the accuracy of results was made by Bowley and is described in his 
report to the International Statistical Institute already mentioned. The 
main results of his study are as follows. 

Random sampling is called unrestricted if at each drawing each of the 
elements forming the population studied has the same chance of being 
drawn. To illustrate this idea I shall point out that, if the population is 
formed by the inhabitants of 300 cities and if the unit of sampling is repre- 
sented by a block, then unrestricted sampling combined with bad luck can 
produce a sample composed of blocks from just one city with the complete 
omission of other cities. However this is extremely unlikely. 

More accurate results could be obtained by what Bowley calls stratified 
sampling and what I call stratified proportional sampling. This consists 
in a twofold subdivision of the population studied. First, we divide it 
into a convenient number of larger parts, called strata. For example, a 
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stratum may be a city or a large section of a city. Next, each stratum is 
divided into units of sampling. If you have decided to work with a sample 
of one-twelfth of the population, then from each stratum separately you 
select at random one-twelfth of its units. This makes it impossible for the 
sample to be devoid of units representative of larger sections of the popu- 
lation studied. 

When we divide the population into strata, we should remember that 
the more homogeneous the single strata, the better will be the effect of 
stratification. In practically every city certain sections are easy to distin- 
guish as those inhabited by the well-to-do, those of poorer people, shopping 
districts and industrial areas. In order to achieve better accuracy, each 
of these sections should be treated as a separate stratum. 

However, homogeneity of a stratum does not necessarily mean equality 
or similarity of all people inhabiting this stratum. In fact, homogeneity 
of a stratum or of a population means a comparative similarity of the units 
of sampling, rather than of the individuals forming the units. If the popu- 
lation of a town is composed of representatives of ten different races, each 
in the same proportion, then we would say, probably, that this population 
is very heterogeneous from the racial point of view. However, from the 
point of view of sampling, this population would be ideally homogeneous 
if it happened that the racial composition of each of its sampling units is 
exactly the same as that of the whole population. Thus one sees that the 
internal heterogeneity of sampling units goes with an external homogeneity 
of these units within the population. This is a general rule. 

From this it follows that the choice of sampling units of a fixed size is 
not indifferent from the point of view of the accuracy of an investigation 
by sample. Dr. Frederick F. Stephan tells me that an investigation has 
shown the existence of a greater similarity between the inhabitants of two 
sides of one street than between those of opposite sides of the same block. 
Hence, if one contemplates dividing the population alternatively into units 
of sampling composed of the inhabitants of the two sides of sections of 
single streets or of the two sides of single blocks, the latter method would 
give more homogeneous units and therefore greater accuracy of sampling. 

Frequently, the gain in accuracy resulting from stratification is consider- 
able, but it is possible to go further than Bowley advised. A cursory glance 
at the situation suggests that the rule of selecting randomly the same propor- 
tion of units out of each stratum may not be the best procedure. You 
can not expect that all the strata will be equally homogeneous internally. 
To make the situation clear, suppose that one of the strata, A, is ideally 
homogeneous, while another, B, is fairly heterogeneous. Then, in order 
to know all about the stratum A, it is sufficient to take a sample of only one 
unit. On the other hand, an accurate estimate of the properties of B would 
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require a sample of considerable size. If we decide to sample both A and B 
in proportion to their sizes (= the number of elements of sampling they 
contain), then we shall “oversample” A and “undersample” B. This intui- 
tive reasoning is fully supported by the theory I developed in my article 
of 1934, already quoted. 

After this somewhat general discussion, let me enter into a few details. 
Consider a population, say I, divided into a certain number s of strata, 
SAY m1, 72, ‘*', mi, ***, ws. Further, assume that the ith stratum contains 
M;, units of sampling, numbered from 1 to M;, and let 


Ui1, Ui2, °° *, Wiz, °°", u,M, (1) 


be the values of a certain numerical characteristic U of these units. Our 
problem is to estimate the grand average U.. referring to the whole popula- 
tion II. In other words, if Mo stands for the total number of units of sampling 
in the whole population, 


Mo = >) Mis (2) 
then e 
Ree ak 1 > 
Ue =—— ) >» uj = — M uy. (3) 
M yyzr en ; Moiai 


where u;. denotes the average value of the characteristic U relating to the 
ith stratum. 

While the general situation is being considered, it is convenient to have in 
mind one or two specific examples. Thus, the purpose of a certain sampling 
survey may be to establish the total number of unemployed in a given area A 
of the country. In this case the area A may be divided into s sections repre- 
senting strata. Each stratum may be sub-divided into a number of convenient 
sampling units, e.g., blocks or merely squares on the map. The symbol u;; 
will denote the number of unemployed inhabiting the jth block of the 7th 
stratum. Once we have estimated the grand average U.. of the number of 
unemployed per block, it is a simple matter, if we know the number of blocks, 
to estimate the total number of unemployed in the whole area A. 

Alternatively, the purpose of the sampling survey may be to estimate 
the average expenditure on housing (or any other item) per family of 
unemployed inhabiting the area A. This problem is a little more compli- 
cated because it splits into two: (1) to estimate the total expenditure on 
housing of all the unemployed and (2) to estimate the total number of 
families of unemployed. Thus, actually, we have a combination of two 
related problems but I intend to discuss in detail only a single problem. 
In this case it would be the problem of estimating the average expenditure 
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on housing of unemployed families per one unit of sampling. Here wi 
would mean the combined expenditure on housing of families of unemployed 
that inhabit the jth block of the ith stratum. If the total number of 
unemployed is known or if it is estimated by the same inquiry, then the 
knowledge of the average per block of expenditures on housing will provide 
the requisite average per family. 

Returning to the general theoretical case, denote by m; the number of units 
which we intend to select from the ith stratum for 2 = 1, 2, ---, s, to form a 
sample on which to base an estimate of U... Denote by X;; the value of the 
characteristic U to be observed in the jth unit of sampling selected from the 
ith stratum. Before the sample is actually drawn, the exact values of Xj, 
X72, ***, Xm; are unknown and any one of the numbers (1) may appear as 
the value of X,,, any one of these numbers may appear as the value of Xj2, 
etc. In fact, the symbols X,1, X,2, +++, Xim, may represent any one of the 
many combinations of m; out of the M; numbers (1). Before the sample is 
drawn, the X;,;’s are random variables. 

Let 

ah my 
eee DS Xi (4) 


Mm; j=1 


Then the best linear unbiased estimate ° of the grand average U.. is 


X..= ae ye MGX;:. (5) 
0 i=1 

It is customary to measure the precision of this estimate by the value of 
its variance, say o7. The smaller the variance is, the better the precision. 
The theoretical problem before us is to determine the best way of using 
the funds available for the survey in order to minimize the variance o”. 
The first solution of this problem is contained in my article already quoted. 
However, this solution applies to the special case where the cost of sampling 
or, more precisely, the average cost per unit of sampling is the same in all 
the s strata. This condition is frequently satisfied. In some cases, how- 
ever, when certain of the strata are urban and others are rural, the average 
cost of sampling a unit may vary considerably from stratum to stratum. 
Since the method of determining the stratification which will be optimum 
in the use of sampling funds is exactly the same whether the average cost 

is constant or not, we shall consider the more general case. 
Assume then that the total expenditure on the survey is fixed and is 


6 The term “best linear unbiased estimate” is very intuitive and familiar to many 
statisticians. Details of the definition and some theory may be found in the excellent 
little book by F. N. David: Probability Theory for Statistical Methods. Cambridge 
University Press, 1949, 230 pp. 
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equal to C dollars. Further let c; represent the average cost per unit of 
sampling in the ith stratum. Then the numbers m, 21= 1, 2, +++, s, of 
units to be selected from the whole population II must satisfy the condition 


mC, + M2Cz ++-+-+ mc; +---+ mecs = C. (6) 


Our problem is to determine the numbers m; so as to satisfy (6) and 
so as to minimize the variance co”. The value of o? corresponding to any 
fixed system of the m; is obtained from the formula 


M; a dA 





vo aden 
Zin’ - é 2 
i os Mo ix a mM; a “2 
where 
1 M; 
- = > Uji — Uj: 4 8 
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represents the internal variability within the zth stratum. Formula (7) 
must be familiar’ to many of you so I shall not bother you with its proof. 
However formula (7) is not convenient for our purposes since it does not 
immediately bring out the effect of choosing this or that system of values 
of the m;, which determines the stratification of the sample. For this reason 
we shall rewrite formula (7) in an alternative form. This is obtained by 
using the familiar identity applicable to any numbers @1, a, -++, % without 
any restriction and to any “weights” w,, We, *+*, Ws, provided the sum of 
the weights is different from zero. The identity in question is 


pe Wa," == a.” Ss WwW; + a w;(a; = a.)? (9) 
t=1 t= i=l 
where 


8 
De Wa; 
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(10) 





(84.4. Gee : 
Dow; 
wast 
is the weighted mean of the a’s. 
Returning to formula (7), we split the right hand side into two parts 
of which only the first depends on the mj, 
1 $ M;70;7 


j= }> 


2 
Mo" \ga1. 1% 





a py Mao? } (11) 


Now consider the first sum within the curved brackets. Multiply and 
divide the 7th term of this sum by the product mjc; and apply formula (9), 


7 This formula is deduced in detail in the book by F. N. David, already quoted. 
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letting mjc; play the role of the weights w; and the quotients 
Mo; 
mie oo 
the role of the arbitrary numbers a;. Remembering condition (6) we have 


eae Pox Z M jo; 2 
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where 
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is the weighted mean of the quotients 
M jo; 
mi Ci 


and appears to be a fixed number, independent of the m;. Substituting (13) 
into (11), we obtain the desired formula for the variance o”, 
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In virtue of (14) only the second term in the curved brackets of (15) 
depends on the numbers m; which determine the stratification of the sample. 


This term is a weighted sum of squares of differences between the quotients 
M0; 
miV ¢; 

and their weighted mean A. It follows that, in order to minimize co”, it is 


both necessary and sufficient to ascribe values, say m,*, to the m, such that 
for each 2 








bey 2 5 dD MyojV oj = A. (16) 


This implies that the optimum stratification of the sample is determined by 
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= —____ » ¢=1,2,°°%,8. (17) 
d MyosVc; vies 
j=l 


Note added in proof: A result equivalent to formula (17) is contained in the book, 
Some Theory of Sampling, by William Edwards Deming (John Wiley and Sons, New 
York, 1950), which appeared between the time the above lines were written and the 
reading of the proofs. However, Deming’s method of obtaining the result is different 
from the one used here. 


It is seen that the numbers m,* so determined automatically satisfy con- 
dition (6) which states that the total cost of sampling must be C. 

In practice it will be impossible to satisfy formula (17) exactly because 
the numbers m,;* must be integers while ordinarily the right hand side 
of (17) is an irrational number. However, this difficulty is trivial and 
the optimum stratification of the sample may be taken as the system of 
s integer numbers closest to the values of the right hand side of (17) for 
1= 1, 2, --+, s, or just exceeding them. With this stratification, we may 
ignore the middle term in (15) and write, say, 


1 1 s 2 8 
oo = F (= MaoiV ci) -> Mo? (18) 


(with a very good approximation) as the minimum variance of the estimate 
of U.. attainable with the optimum stratification of the sample. 

Returning to formula (17), you will notice that, roughly speaking, in 
order to attain the greatest benefit from a given stratification of the popu- 
lation, you should sample more heavily the strata which are more variable 
and also the strata in which sampling is less expensive. 

In order to see the effect of sampling proportionately to the sizes M; of 
the strata, put m, = kM; where k stands for the factor of proportionality. 
This factor is determined from the condition that the total cost of sampling 
must be C, 


Dd, mc; = k D>) Mye; = C (19) 
5 oe ea 
so that 
C 
LI a MO BP in (20) 
D Me; 
jal 


Denoting by o7,;op the variance of the estimate X.. corresponding to the 
proportional system of sampling and using (15) and (18), we have 
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where B stands for the weighted mean of the quotients o;/ V cx, 
1 E ‘ AC 
=> ———_. se M yc; zal rs wae (22) 


The importance of the disadvantage of proportional sampling when com- 
pared to the optimum depends, then, on the variability of the quotients 
oi/ ~/c; If the population sampled contains the whole of a geographic 
area one may expect that o; will be larger in the urban districts than in the 
rural. On the contrary, one may expect the values of the c, on account 
of the cost of travel, to be smaller in the cities than in the country. Thus, 
both factors considered are likely to contribute to the variation of the 
quotients o;/+/c;. As a result, it seems probable that, in order to attain 
the precision in the estimate which is best within the limits of funds avail- 
able, the rural areas should be sampled less heavily than the cities. 

Frequently one hears the assertion that, whatever way one stratifies a 
population, proportional sampling will give results which are always more 
precise than an unrestrictedly random sample of equal size. It is important 
to remember that this assertion is false. To show this, let us consider the 
simplest case where the cost of sampling per unit is exactly the same in 
all parts of the population. 

If we ignore the stratification of the population I and base the estimate of 
one grand mean U.. on an unrestrictedly random sample of mo units drawn 
from II, then the variance of the estimate, say o,,” will be represented by just 
the term similar to the general term in (7), ma 


Mo 8 
Bee Le ae ee ae — U.. (23) 
o : F 
“—— meMo(Mo — 1) im j= ‘ 

By adding and subtracting u;. within the parenthesis and then expanding 
the square, expression (23) reduces to 

M (oem 120) b 
moM (Mo — 1) 
In the following it will be convenient to use the symbols ¢ and (o?) to denote, 
respectively, the weighted averages 
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and 


aay sins Mio? . (26) ‘| 


a en 


Then (24) can be rewritten as 


My — ™ py — : ; 
= ——_——_ Mu U..)? + Mo(o”) — oi (27) 
Se Te th Dd Miu. — : 007) — 
If cy = co =:-+ = Cs, the formula for o”,,; simplifies and becomes 
1 Ll. w= 
2 pt = — (6)? — — (0°). 28 
Cal ity yaa (c) M, (0°) (28) 
Therefore, 
M 
Pe =i pee = iii ce s M; i(Us- rr gee 
moMo(Mo — 1) iaa 
M,? —m — 1 My —m : 
4+ —_*__* __ (g?) — —_ (6)? - __"__"___ ¥ a?._ (29) 
moMo(Mo — 1) Mo moM (Mo — 1) ix1 


Using formula (9), we may write 


@t=F- 1S ate —aF (30) 


Mo i= 


On substituting this expression into (29) and rearranging, we obtain 
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eae iat MoM o(Mo — 1) 


This is an important formula. It indicates the methods of stratification — 


of the population for which an optimally stratified sample will yield the 
best results. Also, this same formula indicates how even an optimum 


stratified sample may fail. The latter circumstance will certainly occur — 
if the stratification of the population is so unlucky that the means of the © 


strata and also the internal variabilities of the strata are all equal. Then 
we have 


Ag 3 Uns, SS os = oe Se (82) 
and 
01 = 0g =*''=o5 = G6. ensiag (s 
In this case, 
M 
Cin ON ante (s a 1) : : (6)? (34) 
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and it appears that even the optimally stratified sample will give less 
precise results than the unrestrictedly random sample. It is true that, in 
most cases, the value of (34) will be negligible. Also, it is most improbable 
that, with reasonable effort at good stratification, the equalities (32) and 
(33) will be satisfied, even approximately. 

Formula (31) indicates that for stratification to be successful when com- 
pared with unrestrictedly random sampling, it is necessary to make the 
particular strata as different as possible (i) with respect to the averages u;. 
and (11) with respect to their internal variability as measured by the a;?. 

You will notice that, while these results are interesting theoretically, 
there is considerable difficulty in applying them in practice. The optimum 
stratification of a sample depends (1) on the sizes of the strata as repre- 
sented by the numbers M; of sampling units in the 7th stratum; (2) on the 
internal variability o;? of the ith stratum and (38) on the average cost c 
per unit of sampling in the 7th stratum. Since the choice of the units of 
sampling is at our disposal, the numbers M; are likely to be known exactly. 
This is not so for the values of o; and, probably, not so for the values of cj. 

If we knew the numbers o;7, we would probably know the numbers 1%. 
and then there would be no need of sampling. It follows that in no practical 
case is there an exact and immediate application of the formulas I have 
given. However, in this respect our situation is no worse than it is in any 
other attempt to apply mathematical results in practice. In every case, 
the theory applies only approximately to the situation studied and the data 
substituted into the mathematical formulas are not exact values of the 
variables concerned, but only approximations. Thus, if the exact values 
of o; are not available, there are ways and means to estimate them approxi- 
mately. One typical situation occurs when a particular kind of survey is 
repeated year after year. In this case last year’s sample may be used to 
estimate the values of o; for the next year’s survey. The theory behind 
this procedure is that, while the average level of a given characteristic 
changes considerably from one year to another, the internal variability of 
the particular strata is much less unstable and, in particular, a stratum 
that appears more variable than the others during one year is also more 
variable the next. 

Undoubtedly, there are cases where no information is available about 
the internal variability of the strata. Then, the best you can do is to use 
a part of the funds available to conduct a preliminary inquiry which, 
incidentally, will be helpful for training enumerators. The size of this 
preliminary inquiry may be very moderate. Out of each of the strata a 
small number of units of sampling, say 20, are selected at random and the 
values of the relevant characteristic U are established. Then these values 
are used to estimate the within strata variances o;7. Finally, the estimates 
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of the variances o,2 are used to determine the optimum stratification of the 
sample to be drawn for the main part of the survey. 

The situation with the cost c; per unit of sampling is similar. The exact 
values of the numbers c; must remain unknown until the books are closed 
on the survey, but more or less accurate estimates are not difficult to obtain, 
especially if the same population has been sampled repeatedly. A closer 
analysis of previous surveys will probably show that within a given stratum 
the total cost of sampling increases somewhat more slowly than in direct 


proportion to the number of units selected for the sample. If this is so, 


the institution concerned will do well to establish for each of the strata a 
schedule of the following kind: if the number of units sampled is between 
20 and 30 say, then the cost per unit will be approximately so much; if the 
number of units sampled is between 30 and 40, then the cost per unit will 
be something else; etc. Figures of this kind could then be used to produce 
tentative values at first and improved values later of the m,* according to 
formula (17). 

It is useful to plan the work so that, by the end of the survey, not only 
the preliminary sampling but also all the data collected could be used to 
obtain better estimates of the c;. Additional computation would then show 
whether or not the stratification of the sample which was actually made 
was far from optimum, how much accuracy was lost and whether or not it 
was worthwhile to try to improve on proportional sampling. Naturally, if 
data are available, such computations should be made before determining 
the scheme of sampling. 

QuEsTION BY Mr. Stock: If you were measuring a number of character- 
istics, to which one would you tie the m,? 

Answer: I welcome this question. It is true that a sizeable inquiry is 
never planned in order to determine a single mean. On the contrary, we 
are always interested in a number of characteristics of the population 
studied and we must make a choice between them. In some cases there 
may be a characteristic of the population which is overwhelmingly more 
important than the others and then the choice is easy. In other cases 
there is a group of several characteristics about equally important, and 
then the situation is more complicated and may be satisfactorily resolved 
only after some study. Let me illustrate this point on a particular survey 
conducted by the Institute for Social Problems in which I took part and 
which brought me into contact with problems of sampling human popu- 
lations. 

The survey was undertaken in connection with a reform of the Polish 
system of social insurance and was meant to provide a basis for deter- 
mining the contributions payable by workers and by employers. For this 

















purpose it was necessary to estimate the total number of workers subject ~ 
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to insurance, their age distribution, family status, etc. All this information 
was available in the data of the 1931 general census of Poland. Unfortu- 
nately, however, a complete tabulation of the census was not expected for 
some years to come and the actuarial computations had to be based on a 
sample taken from the original census records. Thus we were faced with 
the problem of sampling, not the population of living persons, but the census 
data. 

The whole of Poland was divided into 26 strata. The subdivision was 
made taking into account both the convenience of sampling and the general 
principle that the strata internally should be as uniform as possible. The 
way in which the data were stored enforced the adoption of the enumeration 
district as the unit of sampling. 

Although we intended to study many characteristics of the population, 
we agreed to consider the following six as the most important: 


x = total sum at risk connected with the sickness insurance of em- 
ployed males, aged 20-64. 
y = total number of employed males. 
z = total number of employed males, aged 20-64. 
u = total number of employed males, aged above 64. 
v = total number of employed females. 
= total number of insurable population. 


Since no precise information was available beforehand about the internal 
variability of the individual strata, it was necessary to resort to a prelimi- 
nary inquiry. Table I, compiled from the data in my Polish publication 
quoted, gives each of the strata, the numbers M; of elements of sampling 
and the estimates of the numbers o; computed for each of the six charac- 
teristics 2, y, Z, U, V, W. 

When following the columns of the estimated standard deviations within 
the individual strata, you will hardly fail to notice that the standard 
deviations of the six characteristics are positively correlated. Consider, for 
example, the last four strata. Although no strict regularity exists, it is 
obvious that frequently a stratum greatly variable with respect to one 
characteristic is also variable with respect to the others. This empirical 
fact has a theoretical explanation and is connected with the circumstance 
that the characteristics of the particular units of sampling are usually cor- 
related. This correlation may be positive or negative, but the resulting 
correlation between the corresponding o; is always positive. The corre- 
lation coefficients between the estimates of o; for w on the one hand and 
for x, y, 2, u, v on the other, are given in Table II. 

Thus if we stratify the sample so that optimum conditions for one of the 
characteristics is approached, then as a result of this correlation, we are 
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TABLE I 


Sizes of strata and their internal variability with respect to six important characteristics 


Estimates of internal variability of strata 


Stratum | Size of in terms of the o; 

No. stratum 

1 M; 

x y 2 u v w 

1 2,041 .83 16 14 sea ayy) 52 

2 1,185 .95 20 17 2.8 8.2 50 

3 3,032 1.66 25 23 235 9.6 81 

4 371 .97 iat 10 1.6 8.2 37 

5 249 1.03 14 12 1.6 9.4 36 

6 681 .84 19 16 134 13.9 68 

a 3,432 48 16 15 ie) 3.2 54 

8 801 .79 17 16 “Fe 5.4 45 

9 2,196 1.16 Pail 18 2.8 12.9 66 

10 4,079 1.62 23 21 1.0 9.8 63 

11 2,952 .49 9 7 3.0 9.3 20 
12 1,123 .83 10 9 A 3.5 16 

13 1,516 iva 15 12 225 22 28 

14 1,990 1.04 21 18 1.6 20.5 88 | 
15 998 1.88 Zt 19 2.1 7.6 WZ | 
16 762 62 10 8 iets 10.1 21 | 
17 2,867 stati 11 10 ih 9.3 34 
18 443 stf? ih 10 1.9 16.1 26 

19 2:'385). (1037 Su Oph 3 8.0 | 34 | 
20 4,326 51 12 10 122 8.7 27 

ZL 2,985 36 7 7 1.4 8.8 38 | 
22 1,243 65 12 1 <4: 9.3 76 

23 29 , 885 48 8 ci we Due 28 

24 18,636 16 8 6 9 2.2 21 
25 10,906 1.29 35 31 1.0 9.2 86 | 
26 22 , 299 27 6 4 .3 6.5 20 


Total 123 5383 eee 8 7 at peda Ey | 





TABLE II 


Coefficients of correlation between the estimates of o; for w and those for x, y, 2, u, v 





Characteristic x y z u v 


Correlation coefficient .512 .807 .799 Ahly .363 
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likely to do reasonably well for the others. In this particular inquiry of 
the Institute for Social Problems, it was considered that the total number 
of insurable workers was the most important characteristic and the strati- 
fication of the sample was adjusted in accordance with the variability of w;. 

Table I illustrates other interesting details that occurred in the Polish 
inquiry and may occur in others. It will be seen from the column of M; 
that the sizes of the individual strata varied between very broad limits. 
In fact, the smallest stratum contains only 249 units of sampling, while 
the largest contains almost 30 thousand! As I have already mentioned, 
great care was taken in establishing the strata to use existing information 
(unfortunately, this information was predominantly qualitative in character, 
without actual figures) in order to obtain strata internally as uniform as 
possible. Thus, whenever a large block of the country was left undivided 
as a single stratum, it was because of the prevailing belief that the block 
was very uniform. In a number of cases this guess proved successful. 
For example, the three very large strata, Nos. 23, 24 and 26, are rather 
uniform with respect to all six characteristics studied. However, whenever 
guesswork and intuition underlie our actions, surprises are unavoidable 
and stratum No. 25 provided something like a shock. When the preliminary 
sample of 15 units indicated such great variability in No. 25, the committee 
in charge of sampling was inclined to ascribe this occurrence to a random 
sampling error and to disbelieve the figures obtained. Accordingly, the 
preliminary sample from stratum 25 was raised to 34 units. Naturally, the 
new estimate of os; differed from the first, but the conclusion as to the 
internal variability of this stratum remained unchanged. 

In this particular case no harm was done by enlarging the preliminary 
sample because, in order to complete the sampling, we had to select from 
the same stratum an additional 300 units. However, the reverse situation 
with strata 4, 5, and 8 did cause a certain loss in the precision of the final 
sample. With respect to these strata, small in size, it was believed that, 
owing to their industrialized character, they would be internally rather 
heterogeneous. When the preliminary samples contradicted this expecta- 
tion, the samples were markedly increased with no essential change in the 
final conclusion. As a result, the estimated values of the m,* for the three 
strata were 5, 3 and 18, respectively. However, in the preliminary sampling 
we had already selected 33 sampling units from stratum 4, 21 units from 
stratum 5 and 61 units from stratum 8. Thus the preliminary inquiry 
oversampled a number of strata and, in consequence, since the funds for 
sampling were strictly limited, undersampling of the other strata was un- 
avoidable. As a result, after the preliminary inquiry had been completed 
and a substantial part of the funds had been spent, we were faced with 
(seemingly) the new problem of how to apportion the balance of the money 


122 MATHEMATICAL STATISTICS AND PROBABILITY 


among the undersampled strata so as to attain the greatest accuracy of 
results. 

This problem is only seemingly new and is immediately reduced to the 
use of the same formulae (17). It is obvious that no more sampling was 
needed from strata which were already oversampled. Thus, the problem 
of the best use of the money reduced itself to minimizing the total of those 
terms in formula (7) which referred to the undersampled strata. Naturally, 
this had to be done with the use of a new value of mo, equal to the initial 
value minus the number of units of sampling already selected from the 
oversampled strata. 

QuESTION BY Dr. Sipngey Wiucox: If you had been advising the Italian 
census people, what specific advice would you have given? 

Answer: I would have advised them to consider their circondari not as 
units of sampling but as strata. These strata should have been subdivided 
into units of sampling as small as the character of the material permitted 
—nparishes, streets, single houses, whatever was possible. As a matter of 
fact, I remember seeing a footnote in Gini and Galvani’s paper in which 
they themselves suggest that probably their results would have been more 
satisfactory if, instead of sampling circondari, they had sampled parishes. 
In this, of course, they are perfectly correct. 

There is a special difficulty in carrying out an inquiry based on a random 
sample, which seems to be worth mentioning. This is psychological in 
nature. Generally we do not rely on random sampling. Intuitively, we 
are inclined to think that it is not wise to rely on chance if there is any 
knowledge available to guide our steps. I have seen many instances 
where a feeling similar to this has made it difficult to reach a decision on 
how an inquiry should be carried out. I remember very well the doubts 
that I myself had. “That’s all right in theory,” I thought, “but how would 
this random sampling work in practice?” Then a great discovery satisfied 
me how to make up my mind; and since that discovery has worked well 
with other people, I shall mention it to you. It consists in a simple rule: 
try and see. As far as our intuitive feeling against some theoretical result 
is concerned, there is nothing like an experiment. In the case of a planned 
inquiry by sampling, and the question of how to sample, I would take 
some 1000 sheets from the data, consider them as a sampled population 
and perform on them in detail all the steps of the several alternative 
methods of sampling that are contemplated. But I must add two warnings. 

(a) The population in this experimental sampling, like the populations 
we study in practice, must be sufficiently heterogeneous. 

(b) The size of the random sample you draw in experimental study must 
contain a sufficient number of units, say 80 or 100. 
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I am certain that a few trials of this sort will appeal to your intuition 
and will give you a comfortable feeling of safety in random sampling, in 
spite of the fact that in sampling randomly you sometimes ignore knowledge 
of certain details. But you must remember that in following the indications 
of the theory you make use of some other kind of knowledge, that of mathe- 
matical statistics. 

QugEsTION BY Dr. Siwnny Wiucox: I would like to ask a question that 
is somewhat related to this matter of drawing the sample. It is fairly 
common practice to take a list of the elements of sampling and to start 
with one that is selected by some device or other and then to take every 
tenth or twentieth on down the list in order to form the sample. This 
plan is often used instead of setting up a system of random numbers or 
drawing numbers at random and then selecting the sample according to 
the model or game of chance. Are there any advantages or disadvantages 
that one should bear in mind when making use of the device of taking 
every tenth name on the list, every tenth family, house or district? 

Answer: I think there is a definite advantage in using a mechanical 
process of random sampling throughout; that is to say, not taking every 
tenth unit as listed. Sometimes nothing will be improved and then your 
tenth or twentieth house will be as good. But there is the possibility, espe- 
cially in new and properly planned towns, that if you take every twentieth 
or fifteenth house, you will be synchronized with something very essential 
in the town itself. I know of one small inquiry where they took a sample 
of houses in a few villages. As the houses were numbered, they decided 
to take every fifth or every tenth, and hoped to obtain a very good sample. 
But what they obtained was something very surprising. After going back 
to the sampled villages, they found that house No. 1 was always the one 
belonging to the squire and this disturbed the sample. In new towns it is 
likely that every block will have the same number of houses. Therefore, 
if you take every fifth house, you may either omit corners or systematically 
include all of them, and thus you may introduce a considerable bias in the 
sample. 

It is essential to be clear about the exact nature of the procedure sug- 
gested. The process is this. We take the first ten units of sampling listed 
and select one of them at random. Let z be its order number. Then to 
form the sample we take the units numbered z, x + 10, x + 20, --:-, ete. 
It will be seen that this procedure is equivalent to dividing the population 
to be sampled into 10 parts, 


Ist part, pra ane units No.1, 11, 21, 31, 
ands +: ie a NLawees on; 
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3rd part, sampling units No. 3, 18, 23, 33, --> 


10th “ « « «10, 20, 30, 40, «+. 


Then we treat these parts as units of sampling and take only one of them 
to form a sample. 

Obviously, if we proceed in this way we do not rely on the theory of 
probability but on good luck, with the hope that the ten parts into which 
the whole population is divided are very similar. This will frequently be 
the case, but there are obvious dangers. I recommend that one rely on 
chance as governed by the empirical law of big numbers, but I do not 
recommend that one rely on good luck. 

As a matter of fact, there are no special difficulties in sampling randomly. 
There is a very useful little book of Tippett’s Random Sampling Numbers ® 
which may be recommended for the purpose. If your sampling units are 
listed and numbered in order, to take a random sample of them, you simply 
open the book and read in turn a sufficient amount of numbers. Whenever 
the same number appears twice, you simply ignore it. Also, you ignore 
all numbers exceeding the total of your sampling units. 

QuESTION BY Dr. Lana: I do not see how this system can be applied to 
names that are listed alphabetically. 

Answer: Before using Tippett’s Random Sampling Numbers you will 
have to number all your names. 

In regard to the question just discussed, it may be useful to mention 
that in many cases every tenth item will give as good a sample as the 


8L. H. C. Tippett: “Random Sampling Numbers,” Tracts for Computers, No. XV, 
Cambridge University Press, 1927, viii + 26 pp. 

See also two newer tables of random numbers: R. A. Fisher and Frank Yates: Sta- 
tistical Tables for Biological, Agricultural and Medical Research, Oliver and Boyd, Lon- 
don, 1938, 90 pp. M. G. Kendall and B. Babington Smith: “Tables of Random Sam- 
pling Numbers,” Tracts for Computers, No. XXIV, Cambridge University Press, 1939, 
x + 60 pp. 

In recent times, with the advent of high speed calculators capable of producing rap- 
idly great quantities of “random numbers,” and with the increased use of punch card 
machines, tables of random numbers have given way to sets of random numbers 
punched on cards. A set of a million or so random numbers on cards should be con- 
sidered a regular part of the equipment of every modern institution engaged in sam- 
pling surveys. 

The words “random numbers” are placed in quotation marks. It is hoped that the 
reader of this book will realize that, strictly speaking, no such thing as a “random 
number” or a “set of random numbers” can actually exist. What can, and actually 
does exist is a method of producing numbers imitating successfully the concept of inde- 
pendent sampling from a uniformly distributed population. 
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application of Tippett’s numbers. Other methods may be used also. It is 
very difficult to give a general rule for distinguishing between reasonable 
precautions to insure randomness and attempts to “split hairs.” Here the 
research worker must acquire experience and use his own judgment. It 
must be emphasized, however, that the use of random numbers does not 
present any difficulty, and that their use puts you on the safe side. 

QuESTION BY Mr. Kantor: Suppose that you have to sample the workers 
in various industries in several states or other geographical areas. You 
do not have any record of the unemployed, and you want a sample that 
will give you the percentage of unemployed in each industry for each of 
the areas. The reason for the different areas is that there may be economic 
factors that affect the unemployment rate in an area where there is a small 
part of the industry as contrasted with the area where there is a major 
center of it or where there is diversified or unified industry. How can one 
go about getting a sample that would give results equally accurate for each 
industry within each district? 

Answer: There is no particular difficulty in approaching the ideal of 
equally accurate estimates for different areas concerning the same industry 
but it may be impossible to attain in addition to this a similar equality in 
accuracy for all industries. The situation you describe is more complicated 
than the ones we have considered. The different areas you mention, let 
us call them partial populations, must be considered separately. 

In particular, each partial population must be stratified. If the internal 
variability of each stratum is known or can be estimated, the application 
of formula (17) will determine the optimum stratification of the sample 
to be taken from each partial population. Then the optimum variance of 
the sample mean will become a function of the total number of elements 
which will be selected from any given partial population. The problem of 
allocating the available funds to particular partial populations so as to 
insure the same precision for each will reduce to something very similar to 
that described above and I am sure will present no new difficulties. 

QursTion BY Mr. Kantor: In attempting to get an estimate of the 
variability that you are going to use in deciding what proportion to draw, 
you will have to take a test count in each of your areas; you have the 
count scattered over a number of characteristics; it is no longer one charac- 
teristic that you measure. You would have to get a test drawing for a 
number of industries in each of your areas and then compute actual 
unemployment rates. Isn’t that the only way in which you can proceed 
with many industries? It seems to me that you have to take a full count. 

Answer: I do not think so. The preliminary inquiry designed to esti- 
mate the variability of the strata may be very small in size. Dr. Sukhatme 
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investigated this question ° and found that 20-30 units of sampling out of 
each stratum would be plenty. Indeed he suggests as few as 15. Also it is 
not necessary to make a separate preliminary inquiry for each industry. 
You make one such inquiry for an area and use it to estimate o;(t) for 
each of the industries in turn. Then, substitute your estimate of the true 
o;(2) in formula (17), separately for each industry. You will see that this 
formula will give more or less similar results for all industries. Alterna- 
tively, you may adjust the proportions of sampling to some single character 
treated as basic. I would choose for this the total number of workers 
within the sampling unit since it is likely to be highly correlated with the 
numbers of unemployed. 

QuESTION BY Mr. Kantor: In industry, we find that there are very great 
differences in the proportion of unemployed, depending on the production 
rate of the industry to which the workers are attached. During a depres- 
sion, the production of goods for use in further production declines very 
rapidly, but the production of articles made for general consumption 
declines only slightly; an area devoted principally to the former type of 
production will have very high unemployment and an area devoted largely 
to the latter type of production will have small unemployment. Is this 
the variability that we can test by drawing a small preliminary sample? 

Answer: The variability of which you speak does not cause any trouble 
since this is a variability between strata or perhaps between partial popu- 
lations. I presume that the distribution of industries over the country is 
more or less known and that, when stratifying, you will be able to distin- 
guish areas differing in the general character of the prevailing industries. 
If you look closely into my formulas, you will notice that they depend 
upon the variability within the partial population and, more particularly, 
within the strata. Denote by w the number of workers within a unit of 
sampling, and by x the number unemployed. If you take one particular 
stratum and study the units of sampling, you may find a picture something 
like this: 





Values of w 100 150 35 10 200 
Values of x 10 ihe? 1 2 25 


and you will have no difficulty in noticing that x and w are correlated. 
Because of this correlation, the stratification of the sample which is 
optimum for w will be reasonably good for x Something of this sort 
actually happened in an inquiry in Poland. 


9P. V. Sukhatme: “Contribution to the theory of the representative method.” Jr. 
Roy. Soc. Stat. Supplement, Vol. 2 (1935), pp. 253-268. 
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The plan to take one basic character as a unit has the advantage that 
the preliminary inquiry may be very inexpensive and yet satisfactory. The 
enumerators could be asked to establish only the number of workers inhabit- 
ing the units of sampling, a task which takes but very little time and 
effort. But also this procedure has the definite disadvantage that, if you 
work with the basic character alone, the data collected during the prelimi- 
nary inquiry cannot be included in the main one. Therefore, probably 
I would carry out the preliminary exactly as the main one is to be made, 
the only difference being one of size. I would estimate o; separately for 
each industry and substitute it into formula (17). Then I would see what 
happens and what would be the accuracies of the average that I would 
obtain by this or that system of stratifying the sample. 

QugEsTION BY Mr. Miuton FriepMANn: In many cases the set of character- 
istics that it is desired to study includes some about which information 
can be obtained with relative ease and others about which information can 
be secured only through long and expensive interviews. In such cases it 
may be advisable to secure information on the first set of characteristics 
from a large random sample. This information may then be used to select 
a smaller stratified sample from which the second type of data can be 
secured. From the random sample would also be obtained weights to be 
used in combining the data from the various strata of the stratified sample. 

Thus, in the Study of Consumer Purchases, which is now being conducted 
under the auspices of the National Resources Committee, the Bureau of 
Labor Statistics, and the Bureau of Home Economics, the primary aim 
is to secure information on family expenditures. The sample from which 
such data are secured is, however, stratified with respect to income (as 
well as other characteristics). At the same time, there are no data on the 
relative frequencies of the different income classes. As a consequence, it 
was necessary to obtain information on income from a random sample of 
families in order to secure the weights for combining the data from the 
stratified sample. In view of the extremely high costs involved in securing 
the data on expenditures, and of the relatively low costs of securing the 
data on incomes, it was decided to make the random sample from which 
income information was obtained very much larger than the stratified 
sample giving information on expenditures. 

The question I should like to ask is whether or not any work has been 
- done that would indicate the optimum relative size of the two samples on 
the assumption that the relative costs and the relevant standard deviations 
are known. 

Answer: As far as I know, nothing has been done on the specific question 
you raise. I take it, however, that in such a case it would be necessary 
to conduct two preliminary inquiries, one designed to determine the relative 
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frequency of the different income classes, and the other to determine the 
standard deviations for the item in which you are particularly interested, 
for the different strata. The second preliminary investigation, as I have 
already indicated, would need to cover only a relatively small number of 
cases. 

QueEsTION (Mr. Friedman’s question restated by Dr. Sidney Wilcox): 
For at least part of the work one step was taken in trying to get a random 
sample using every nth card, starting not with the first card but with a 
card which itself would be the result of accident. This was the process of 
finding out for a given city what proportion of the people are wage earners 
and clerical workers and what proportion are at one or another income level. 
This was an inexpensive survey. Then a long laborious process had to be 
followed in finding out in detail how they spent their money. The number 
of families responding to the more elaborate questionnaire might have no 
very close relationship to the number of families in the particular type of 
occupational activity or income level. And so the question of weights 
comes up. What should be the relative number that should be secured on 
the random basis? Should we take every tenth family or, knowing in 
advance approximately the costs of the operations and therefore how many 
schedules we are going to be able to get on the expenditure basis, how 
heavy a sample should we have taken on the random basis? What is the 
relative size of the random sample? Of the larger sample to the smaller? 

ANSWER: I repeat, as far as I am aware, the question asked has not been 
considered; but it is so interesting that I shall be glad to see whether or 
not it can be answered by some simple method. If I succeed, I will certainly 
try to publish the results. 


Part 2. Theory of Friedman-Wilcox Method of Sampling 


(This section is a textual reproduction of the article, “Contribution to the Theory of 
Sampling Human Populations,” by the present author, originally published in the 
Journal of the American Statistical Association, Vol. 33 (1938), pp. 101-116. The author 
is deeply indebted to the Editors of the Journal for their kind permission to reproduce 
the paper.) 


1. INTRODUCTION 


At a Conference on Sampling Human Populations held last April at the 
Department of Agriculture Graduate School in Washington, a problem was 
presented by Mr. Milton Friedman and Dr. Sidney Wilcox for which I could 
not offer a solution at the time. Since it seemed to be important and of gen- 
eral interest, I have considered it in some detail. The purpose of this paper 
is to present the results I have obtained. 
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2. STATEMENT OF THE PROBLEM 


I shall start by describing the problem in much the same form as it was 
stated to me, without using any mathematical symbols. Then I shall for- 
mulate it in mathematical terms. The reader who does not wish to follow 
the mathematical processes may skip from equation (8) to the results and 
examples beginning with equation (52) on page 138. 

A field survey is to be undertaken to determine the average value of some 
character of a population, for example, the amount of money which families 
spend for food in a population of families residing in a certain district. The 
collection of these data requires long interviews by specially trained enumer- 
ators and, hence, the cost per family is quite high. Since the total cost of the 
survey must be held within the amount appropriated for it, the data must be 
secured from a small sample of the population. In view of the great varia- 
bility of the character, the sample appears to be too small to yield an estimate 
of the desired degree of accuracy. 

Now the character is correlated with a second character which can be deter- 
mined much more readily and at a low cost per family. Since a very accurate 
estimate of the second character can be secured at relatively small expense, 
and since for any given value of it, the variation of the original character will 
be smaller than it is in the whole population, a more accurate estimate of the 
original character may be obtained for the same total expenditure by arrang- 
ing the sampling of the population in two steps. The first step is to secure 
data, for the second character only, from a relatively large random sample of 
the population in order to obtain an accurate estimate of the distribution of 
this character. The second step is to divide this sample, as in stratified 
sampling, into classes or strata according to the value of the second character 
and to draw at random from each of the strata a small sample for the costly 
intensive interviewing necessary to secure data regarding the first character. 

An estimate of the first character based on these samples may be more 
accurate than one based on an equally expensive sample drawn at random 
without stratification. The question is to determine for a given expenditure, 
the sizes of the initial sample and the subsequent samples which yield the most 
accurate estimate of the first character. 

Let us now enter into the details and introduce the necessary notation. 
Denote by z the population studied and by X the character of its individuals 
the average of which, say X, is to be estimated. This is the character the 
collection of data on which is costly. Next let Y denote the second character, 
on which the collection of data is cheap, and which is assumed to be corre- 
lated with X. The range of variation of Y in z being more or less known, we 
shall divide it into s intervals, say 


from Yo to Y,, from Y, to Yo, «++, and from Y,_, to Ys. (1) 
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Denote by 7; the part of the population 7 composed of the individuals for 
which 
Ye ai Vs = Lj 2, anita); (2) 


x, will be called the ith stratum of the population 7. Denote further by 
Pi, P2, °**, Ds (3) 


the proportions of the individuals of 7 belonging to the strata 7, 72, +++, 7s 
respectively. 

In the following we shall have to consider three different processes of sam- 
pling which it is important to distinguish. The first two form the method 
described by Mr. Friedman and Dr. Wilcox, which I shall further describe as 
the method of double sampling. The third will serve as a standard of com- 
parison of the accuracy of the method of double sampling. In order to avoid 
any misunderstanding let us describe all three in detail. 

The method of double sampling consists of the following steps: 

(i) Out of the population z we select at random N individuals and ascertain 
for them the values of the character Y. This sample will be denoted by Sj. 
The sample S; is meant to estimate the proportions 7;. 

(ii) Now we proceed to sample the strata 7; and this is the second of the 
sampling processes mentioned. Out of each stratum 7; we select at random 
m, individuals which form a sample to be denoted by S2,; and ascertain for 
each of these individuals, the value of the character X. The samples S2,; serve 
to estimate the mean value of X in each of the strata z;. These estimates and 
the estimates of the proportions (3) obtained previously from the sample Sj, 
permit us to estimate the grand mean X. 

The combination of (i) and (ii) forms the method of double sampling. De- 
note by mo the sum of the sizes m;, of all the samples S2,;, so that 


Mo = » mj; (4) 


and by A and B the costs of ascertaining for one individual the value of X 
and that of Y respectively. Finally, let C denote the total amount of money 
available for the collection of data. Then the numbers mp and N must be 
subject to the restriction 

Am + BN = C. (5) 


We shall consider what values of m;, mo and N, satisfying conditions (4) 
and (5), yield the greatest accuracy in estimating the mean value of X by the 
method of double sampling. This accuracy will then be compared with that 
attainable in the ordinary way, that is, without the application of the method 
of double sampling. For this purpose we shall consider a third sampling 
process by which all the funds C available are spent on selecting at random a 
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number, say M, of the elements of 7 and in ascertaining for each of them the 
value of X. Denote this third sample by So. Its size will have to be M = 
C/A. In order to get an idea of the utility of the method of double sampling 
we shall compare its accuracy with that of the ordinary mean value of X 
calculated from the sample So. 


3. FIRST METHOD OF APPROACH 


In the present paper! we shall make no assumption as to the character 
of the regression of X on Y in the population 7. Denote by X;1, Xo, ---, Xs, 
the mean values of X in each of the strata. It follows that the grand mean of 
X which is to be estimated is 


x => De DXi. (6) 


Further denote by o; the standard deviation of X within the 7th stratum. 
Denote by n; the number of individuals drawn in the first sample S; which 
fall within the zth stratum and introduce 


ni 

N 

Let x;; denote the value of X of the jth individual drawn from the zth stratum 
to form the sample Sg; Put 


LP ee y Lei. (8) 


(7) 


ie = 


We shall start by considering what function F; of the observations, namely, 
of the numbers (7) and of 


Biry Lins 5 Lim, for 2°="1,2;>-=-,'s (9) 


would be suitable as an estimate of (6). We shall limit our considerations to 
homogeneous functions of second order, of the form 


8 s mi 
Py = DD DD DD Nageritin (10) 
t=) j=l k=!) 
where ,;;, is a constant coefficient. Out of all such functions we shall select 
and term the best unbiased estimate of X, the one which has the following 
properties: 
(i) The mathematical expectation of F’, is identically equal to X. 
(ii) The variance of F; is smaller than that of any other function of the 
form (10) having the property (i). 
1The same problem, under the assumption that the regression of X on Y has a certain 
known form, forming the second method of approach, will be considered in a later paper. 
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Denoting by &(u) the mathematical expectation of any variable u, we may 
rewrite (i) in the following form. 


(F:) = OD vnblrita) = y piX (11) 
t=1 j=1 k=1 1 

When calculating expectations, we shall use the assumption that the popula- 

tion 7 and all of its strata are so large compared to the sample drawn that the 

particular drawings can be considered as mutually independent. We shall 

notice further that, in spite of the fact that the samples S2,; probably will be 

drawn out of the sample S; and not directly from the strata 7;, the variable 

x;~ 18 independent of r;. This follows from the circumstance that when we 

draw the first sample S;, we do so without any consideration of the values 
of X. It follows that 


E(ritjn) = E(r:)E(xjn) = piX;. (12) 


Substituting (12) in (11) and rearranging, we have 


Dw (E jE ha - Xi) 


The necessary and sufficient condition for this equality to hold good identi- 
cally, that is to say, whatever the unknown proportions p1, pe, «++, Ds May 
be, is that the coefficients of the p; vanish, i.e., 


0. (13) 


xX, Dn =0 fori=1,2,---,s. (14) 
j=l 


As we do not know the values of the X;, these equalities should again hold 
good identically, that is to say, whatever the values of the X;._ The equation 
(14) can be rewritten in the form 


r—t mj; 
Dd Xs DL Mae + x.(¥ ik — 1) + speek a Nik =O (15) 
j=l k=1 


j=it+1 k=1 


and its identical fulfillment is easily seen to require that 
i 
>> Age = 0 for any j ¥ 1;7,j = 1, 2,>-+,8 
and . (16) 


> Mie = 1 fort = 1,2,---,8 


Equations (16) express the necessary and sufficient conditions for the function 
F, to be unbiased, considered as an estimate of X. Obviously, there is an 
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infinite number of systems of coefficients \;;, satisfying (16) and therefore an 
infinity of unbiased estimates of X of the form (10). We shall now determine 
the one that we agreed to call the “best,” i.e., that which has the smallest 
possible variance. Let us assume that the values of the 2,;;; are fixed some- 
how satisfying the conditions (16) and calculate the variance of F;. Denoting 
it by V; we shall have, owing to (6) 


Vi = &(F, — X)? 


8 2 Ure, 
= 6 p> ("5&5 — px) 
t=1 
where mae 
Ep = DY Navas (18) 


j=1 k=1 


is again independent of r;. We have further 


s s 


Vi = dS 8{(riti — piX,))?} + 2 3 D> &f rts — piX2) (ratn — prXn)}- 
pak atthe el 
(19) 


But 


SEN paces ia Piss wg tea | 
ELC pie teh Spee Be = Xb) 
35 Oe et 
= &{(r; — pi)” }8(E?) + pP8{(E — X.)"} (20) 


owing to the independence of £; and r; and to the fact that &(r;) = p;. Now 
it is known that 


&(rié — piX;)? 


Il 


Pidi 
N 





8{(ri — pi)”} = E(r?) — p? = (21) 


with g; = 1— p;. Since? 

&{(&: — X,)?} = 8(&7) — X?, (22) 
to calculate (20) it will be sufficient to calculate &{(& — X :)*} or the variance 
of é;. Applying the usual formula for the variance of a linear function of 


independent variables and remembering that the variance of xj, is denoted 
by o;”, we have 


&{ (Es uN) alee vy ya Nase (23) 
are 


j=l 


2 Owing to (16) the expectation of & is obviously equal to X;. 
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It follows that 
Dedii( he ea : a 
Sl Gab; = pX4)-} = =~ (= oj De eee x?) + pi >» of oS. VN sike 
N j=1 k=1 7! k=1 
(24) 


We may now go on and calculate the expectation of the other type of term 
in (19). We have 

E{ (rigs — DiXa)(Ta& — PrXn)} = Elratnkién) — DiprXiXn 
E(ritn) (Een) — DiprXiXn (25) 


again owing to the independence of £; and r;._ It is known that 


E(rith) = DiPn (1 * =): | (26) 


I 


I 


Further 


&(Eé,) = & 6: Do Mek D> DS Neveu) | (27) 
julie gmt unt 


Remembering that 
&(x;x) = ‘XG and & (x? jx) = o;" + Ee (28) 


and that the x; are assumed to be mutually independent, we have 


&(Eé,) = > OF Dy NajeAnik or (3 Xj bs "n) (> Ay SS Meu): (29) 
g—1 a= g=1 U1 


Until the present moment we have not used the conditions (16) for the 
unbiased character of the estimate F;. Therefore the formula for the vari- 
ance V,; which we could obtain by substituting (24), (25), (26) and (29) into 
(19) would be perfectly general. We shall use it in our second method of 
approach. Now, however, we shall simplify (29) by substituting (16). We 
have 








&(Eta) = s oj ay Nijednsk + XiXp. (80) 
Now 2 
= : 2 Pa) x2 Pidi X; 2 
AG ng eo p> as » N 


te oy) Dd, Pipr{(N — 1) sy o;” 3 AigeAnik — XiXp}. (31) 
N ; t=1 h=i+1 j=1 
Without attempting to simplify this expression at the present stage, let us 
select the \;;; So as to minimize (81) while keeping the relations (16) satisfied. 
For this purpose we will differentiate with respect to jj, the expression 
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a aad 5 5 Oj x Aish (32) 


t=1 j=1 


where the a;; are Lagrange arbitrary multipliers, and equate the derivatives to 
zero. After some rearrangement, we get the following equation: 


] 3 : 
v Diaz” {daze + (N — 1) dS padajn} = o4;. (33) 
ae 


Summing both sides with respect to k from zero to m; and taking into account 
(16), we get 








N-1 
N Dj D0; = Mj; for 7 F i] 
(34) 
e pijog’ {1 + (N — 1)p;} = myoy;. 
Substituting these results in (33), we obtain 
Aik = — (N — 1) DD padaje = -z— (Say) (35) 
4) h=1 
1 
Ajik = A-jk + —: (36) 
mj 
Substituting in (35) the values of \;;, thus obtained, we easily get 
Ne = A-jk = O for’ = 7 
(37) 
1 
\jik = —- 
mj 


| Substituting these values into (10) we obtain the following expression for the 
best unbiased estimate of X: 


= Drs. (38) 
t=1 


The formula for the variance, V;, of F, is obtained by substituting (37) in 
(31) 





= Pidi Pidi : 
vi - =| oe +P el — Basue 5 ieptaix ticles) 

ee ey) og Nie 
which immediately reduces to the following form most convenient for finding 
the system of values of N and the m, that assure the greatest accuracy in 
estimating X:; 
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1 8 2 
Veer (x oiV pe + pai) 
Mo \i=1 : 


$s 
> iV v2? + pig 

: oiVpr- +p@a@N ia 
Sig PT PN a mccain 


t=1 Mz Mo 
Peet = 
+=) p(X: — X)*. (40) 
Nay 


It is seen that none of the three terms in the right hand side can be negative. 
There is only one term which depends directly on m,, mg, ---, ms, namely, the 


second, the others being dependent on m) = >, m;and on N. It follows that 
i=1 

once N and mp are fixed in one way or another the value of V; depends on the 

m,; and the value they ascribe to the second term. It is easily seen that its 


minimum value is zero and that this is attained whenever for each value of 


tea 1D: 2 3, 8 
moos pi” + piqiN™ 
n= bs bed a MOT CAST i (41) 
LoiV v2 + pigN 
Owing to the fact that the m; are integers, this ideal seldom can be attained 
exactly, but it may be approached as far as possible. We shall further assume 
that the m; are selected in closest agreement with (41) and that the second 
term in (40) is negligible compared with the remaining two. 
We must now consider what values of mo and N satisfying (5) are likely to 
give the smallest value to the sum of only two terms in (40), say 


1a /22 Sn tk et aa 
VV)’ = (= oi:V p? + pai) +f v » p(X; — X)?. (42) 
rat 


Mo 7!) 


Owing to the complex structure of the first of these terms, an accurate solution 
of the problem is difficult to attain. However, it is easy to get an approxi- 
mate solution which will probably in most cases be sufficient. 

In most cases, whenever we do not make any special assumption concern- 
ing the character of the regression of X on Y, we shall probably classify the 
population z into only a few strata whence it may be assumed that the propor- 
tions 7p; will not be very small and consequently p;q;N—! will be considerably 
smaller than any of the p;”._ If so, then the value of the square root 


V pi? + pig: (48) 


will be very much the same as that of p; For example, if p; = .1, q; = .9 
and N = 100, it is .1044 and if the value of Np; were somewhat larger, the 
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agreement would be still better. Therefore, instead of trying to minimize 
(42) we may usefully start by trying to minimize, say 


a yet patie Be 
VY," = <(y pi) + — >) p(X; — X)?, 
Mo \j=1 Nai 


or (44) 


Ser tae 

Mo N 
for short. Denote by 2; and vz the smallest numbers of selections into the 
first and the second sample respectively, the total cost of which is the same, 


so that 
v,B = VoA. (45) 


If mo’ and N’ are the integer numbers minimizing (44) and satisfying (5), 
then any change of these values by taking instead of them either 


Mo = U9 and WN’ + V1 
or , (46) 
Mo + vg and N’ — vy, 


will increase the value of (44). This means that mo’ and N’ satisfy the in- 
equalities 


























a2 b2 a2 b2 a2 2 
ee > Sh a ew wear ae e -  & (47) 
Mo’ + v2 N' — x, Mo’ N’ Mo’ — V2 N+y ( 
These inequalities reduce easily to the following ones : 
v v 
eee ae 2 
Mo’ a v2 N' Mo! (48) 
Vy mo” b2v V1 
+7 7H 


showing that in order to minimize (44) while keeping (5) fixed, we have to 
select mp and N as nearly as possible proportionately to aV v2 and bV v4 
respectively. Putting for a moment 


a Ito ¢ 
mo = N - d — (49) 
b 
and substituting it in (5), we get 
CoV 1, 


N= 50 
Aav vg + Bov/ 1 ( ) 
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which gives a 


Cav v> 


~ Aa, + BoV 01, 61) 


™o 


Using (45) and eliminating v,; and vg we may rewrite (50) and (51) in the final 


form 
Cb 


> G/B +08 oe 
a 
Mo = Wee (53) 
where 
Giz, Didi (54) 
and ic 
(De > p(X; — X)?. (55) 


i 


Here we must remember the following circumstances: 

(1) that both mp and N are integers and therefore formulae (52) and (53) 
should be calculated to the nearest integer; 

(2) that a change in mo by one unit must be compensated by a change in 
N by several units; 

(3) that the solutions which would be obtained by taking exact values of 
(52) and (53) would minimize the value of (44) with a as given in (54), whereas 
the value of the variance in (42) depends on 


a, = do oiV 2 + piqgiQN (56) 
~=1 


instead of a. 

It follows that the integers nearest to (52) and (53) may not necessarily 
minimize (42), but since the difference between a and a; is slight, they may be 
considered as the first approximations. Frequently these first approxima- 
tions will also be the accurate values. 

In order to find the second approximation, we may calculate a; as in (56) 
substituting N as calculated from (52) and then substitute the value obtained 
into (53) to get a new value of mp. This sometimes will indicate the necessity 
of increasing the original mp by unity. However, owing to the fact that both 
mo and N must be integers, the real check of what values do give the minimum 
is obtained simply by substituting into (42) both the first approximations to 
mo and N and a few neighboring systems of values, e.g., mp — 1 and mj + 1 
and the corresponding values of NV. 




















STATISTICAL PROBLEMS IN SOCIAL AND ECONOMIC RESEARCH. 139 


4. EXAMPLE I 


It may be useful to illustrate the above theory by some simple examples. 
Assume that there are only three strata, so that s = 8. Assume further the 
following values of the constants involved: 


Jt 2 1 


Bil 4} Pot 4's P3 = 4) 
xX, =I], Xo = 3, X3 =.6; 
(57) 
o, = 1, g2 = 2, o3 = 4, 
A= A, (atl Co = 000, 
In order to calculate the values of mp and N, we calculate 
a = 2.25, (58) 
X 3:25, (59) 
B= 3,187 0— (li7804)-< (60) 
It follows that 
N = 142 (to the nearest integer) (61) 
and accordingly 


It will be seen that the necessity of taking mp to the nearest integer permits 
an increase in the value of N to 144, without exceeding the limit of expense, 
500 units. Let us now see how mp = 89 should be distributed between the 
three strata. Easy calculations give 


o1V pr + pigN = .2526 
o2V po” + poq2N* = 1.0035 

(63) 
o3V p3” + p3q3N* = 1.0104 


3 
YS o:V 02 + pig:N7 = 2.2664. 


qt 
Hence, using (41) and taking the nearest integers, we get 
i 10, m2 = 39, My 40. (64) 


With this system of the m; the middle term of formula (40) would have the 
value 


2 («: Vine + pigiN nf DoiV p? + pigiN 


Dm: 


cL 


2 
= ,0000048564. (65) 


mM; mo 
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The total value of V; in (40) is found to be 
Vy = .079855 (66) 


and it follows that by using (64) the value of the middle term is for all practical 
purposes negligible. It is interesting to compare this value with the one 
which could be obtained without adjusting the numbers m; to the variability 
and the size of the strata, i.e., without using (41). Putting arbitrarily m, = 
29, mz = m3 = 30, we get 

V, = .091927. © (67) 


Comparing this with (66) we see that neglecting to adjust the m; according 
to formula (40) results, in this particular example, in an increase of the vari- 
ance by over 15 percent, which is a considerable and unnecessary loss in 
accuracy. 

This is the situation if we use for mp and N the values found as first approxi- 
mations. Substituting 144 for N in (56) and calculating a; and then using 
this value instead of a to calculate the second approximation of mo, we get 


mo = 89.6783 (68) 


which suggests that the best integer values of mp and N are mp = 90 and 
N = 140. However, using them we obtain 


V, = .079866. (69) 
Again using mp = 88 and N = 148 we get 
V; = .079888 (70) 


and it appears that the first approximation gives in fact the best possible 
result, but the actual difference is negligible. 

We must now see whether this result, the best that could be obtained by 
the method of double sampling is actually better than what could be obtained 
by spending all the money available to collect as much data on X as possible, 
i.e. by drawing the unrestricted random sample Spo (see p. 130). 

The best linear estimate of X calculated from the sample Sp would be the 
sample mean #. Its variance, Vo, is known to be connected with the symbols 
of this paper by means of the formula 

dle, free : cs 
Vo=— {Dp + D valk — HI, (71) 
i=l i=l 


It is easy to find that in our example 


oe oe 72 
=a (72) 


and 
Vo = .0755. (73) 
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It follows that in this particular case the method of double sampling, even 
supplemented by the optimum adjustment of the numbers of sampling, is 
equivalent to a certain loss of accuracy of the final result. Taking the ratio 
of the variances (73) and (66) 


V1 
— = 1.058 (74) 
Vo 


we see that this loss of accuracy amounts to nearly 6 percent. This unfavor- 
able result is, of course, due to the fact that the differentiation between the 
strata with respect to the values of X is small compared with the variability 
of the strata themselves and to the fact that the difference in the cost of 
obtaining data on X and Y is comparatively small. To illustrate this point 
let us consider the following examples. 


5. EXAMPLE II 


Assume that the values of the p;, X; and o; are exactly as in Example I 
and put 
A = 40, Bie) Gi=-5000 (75) 


so that the process of obtaining data on Y is now 40 times cheaper than that 
on X, while the ratio of C/A is the same as formerly. It follows that Vo in 
this case will be exactly the same as formerly (73), but the minimum value of 
V, will change. We shall have 


gael LI N = 560 (76) 
and, assuming that the m; are fixed according to (41), we get finally 
Vi = 05147 (77) 


and it is seen that this value is exceeded by Vo by more than 46 percent! 


6. EXAMPLE III 


Here we shall keep the values of the p;, the o;, and those of A, B and C 
as in Example I but change the values of the X; so as to increase the value of 
b, namely, put 


xX, = he Xo = 6, X3 = 11. (78) 
Then 
2 12"5 =2°(8'53553)" (79) 
and 
Vo = .1500. (80) 


On the other hand, applying the method of double sampling and taking the 
optimum system of numbers of samplings, viz., 
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m= 8; Mm, = 31, M3 = 31; mo = 70, N = 220 (81) 


we get 
Vy =31298; (82) 


a gain in accuracy in comparison with (80) of about 15 percent. 


7. CONCLUSIONS 


(i) The examples II and III show that under favorable conditions the 
method of double sampling is a very powerful tool of statistical research. 

(ii) However, the advantages of methods are but rarely universal and in 
certain cases, as for instance in the above example I, the direct unrestricted 
sampling may be more efficient than the method of double sampling. 

(iii) Without a certain previous knowledge of the properties of the popula- 
tion sampled it is impossible to say which of the two methods will be more 
efficient. 

(iv) It is also impossible to tell in advance what the values of N, mo, and 
of the m; should be to assure the greatest accuracy of the double sampling 
method. 

(v) On the other hand, if certain properties of the sampled population 
are known, or can be estimated, then it is possible to estimate the values of 
mo and WN and also those of the m; by which the method of double sampling 
gives the greatest possible accuracy. The properties of population 7 needed 
for this purpose are the values of the p;, o; and X;. They could be estimated 
by means of a preliminary inquiry on the lines suggested by me during the 
conference at the U. 8. Department of Agriculture Graduate School and also 
in my previous publications'on sampling human populations. Once approxi- 
mate values of the p;, o; and X; are obtained, they should be substituted into 
formulae (52), (53) and (41) to obtain the approximations of the optimum 
values of mo, N and the m,. 

(vi) Before deciding whether to apply the method of double sampling, we 
should see that the prospects are that it will give better results than the direct 
unrestricted sampling of values of X. 

For this purpose the approximate values of the p;, o; and X,; should be 
substituted into (40) and (71) to obtain the approximate values of variance 
V, and Vo. The decision to apply the method of double sampling should be 


3 J. Neyman: “An Outline of the Theory and Practice of Representative Method Applied 
in Social Research.” Institute for Social Problems, Warsaw, 1933. Polish with an English 
Summary. 

J. Neyman: “On the Two Different Aspects of the Representative Method.” J.R.S.S. 
1934, pp. 558-625. 

See also P. V. Sukhatme: “Contribution to the Theory of the Representative Method.” 
Supplement to the J.R.S.S., Vol. II, 1935, pp. 253-268. 
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taken only if the approximate value of V; proves to be considerably smaller 
than that of Vo. 

(vii) The steps described in (iii) and (iv) are possible only if some previous 
knowledge of the population 7 is available. This may be obtained from 
various sources: from some previous experience concerning the population 
a, or from a specially arranged preliminary inquiry. Such a preliminary 
inquiry consists of drawing from 7 a relatively small unrestricted random 
sample of individuals and in ascertaining for all of them the values of both 
characters under consideration X and Y. The data thus obtained should be 
used to estimate the p;, the o; and the X;. 

In order to exemplify the kind of previous experience which may be used 
to plan future inquiries on the lines as indicated in (v) and (vi), I may men- 
tion a recent extensive Study of Consumer Purchases, a Federal Works Pro- 
ject administered by the Bureau of Labor Statistics, U. 8. Department of 
Labor and the Bureau of Home Economics, U. 8. Department of Agriculture, 
in cooperation with the National Resources Committee and the Central 
Statistical Board. This inquiry was carried out by method of double sam- 
pling and therefore, in the process of working out the data, both the propor- 
tions p; and the means X; corresponding to particular strata and to many a 
character X must have been estimated. Probably the values of o; are also 
available. These figures could be used as pointed out in (v) and (vi) when 
planning any new inquiry concerning the same characters and the same or 
some similar population. 


Part 3. On a Most Powerful Method of Discovering 
Statistical Regularities 


(This section is based on a talk given before the members of Sigma Xi at a meeting 
of the Society held in Berkeley, California, April 9, 1947.) 


You must have heard the often repeated joke that there are three kinds 
of lies: the polite lie, the malicious lie and statistics. The subject of my 
talk tonight will be the kind of statistics that is frequently a lie although, 
undoubtedly, the authors compiling such statistics do not mean any sort 
of mischief. For the most part they are well meaning but ignorant of 
the theory of statistics and they are the victims of their own lack of pro- 
fessional education. 

There are many ways of handling perfectly correct data which at first 
sight seem intuitively sound but which tend to introduce into the data 
extraneous regularities. These regularities, artificially introduced into the 
observational material, suggest connections between the various factors 


4 Jour. Am. Stat. Assoc., Vol. XX XI, 1936, p. 1385, and Vol. XXXII, 1937, p. 311. 
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which in fact do not exist. My purpose tonight is to describe one example 
of such an analysis of statistical data. If you look through the volumes 
of a statistical-economical or a statistical-sociological journal, you are very 
likely to find examples of practical applications of this method. 

The method in question is so powerful that by means of it one can suc- 
cessfully prove that storks bring babies. Once upon a time an inquisitive 
friend of mine decided to study this question empirically and thereupon 
he collected some relevant data. The data are quite comprehensive and 
refer to 54 different counties. The raw data which he collected are repro- 
duced in Table I. 

The data include the number W of women of child-bearing age (second 
column of Table I, given in units of 10,000), the number S of storks in 
each county (third column) and, finally, the number B of babies born 
during a specified period of time (fourth column). In the beginning, my 
friend had in mind a direct comparison of the numbers S and B. However, 
it was pointed out to him that such a comparison is not convincing because 
the counties vary in size and larger counties may be expected to have 
more women, more babies and also more storks. Thus the variation in the 
size of the county appeared as a disturbing factor hiding the true relation- 
ship between the two quantities S and B. 

In order to eliminate the disturbing influence of the size of the county, 
my friend hit upon the brilliant idea of comparing, not the actual numbers 
of births and the actual numbers of storks, but the birth rates on the one 
hand and the “densities of storks” per 10,000 women on the other. 

Thus he obtained the quantities X and Y as follows, 


S B 
A =—, and Yo==,; 
W W 


and then he tried to compare the two quotients X and Y. Naturally, in 
questions of this kind you cannot expect an absolute regularity. In par- 
ticular, you cannot possibly expect that every increase in the quotient X 
will always be accompanied by a proportional increase in Y. There must 
be fluctuations and so you will expect to find counties with a large density 
of storks and a small birth rate and vice versa. The best you can hope 
for in the way of regularity is that, if you classify all 54 counties according 
to the density of storks and average the corresponding birth rates, then 
the averages will show a variation parallel to the variation in the density 
of storks. To put it professionally: it is inconceivable that the birth rate 
is a monotonic function of the density of storks and the believer in the 
proficiency of these birds must be satisfied if he finds a positive correlation. 

This was the attitude of my friend and he compiled Table II. I have 
checked the figures in Table II and so have several other people. We found 
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TABLE I 


Do storks bring babies?—Raw data 


Women Women 








ee id Starke sui County ak Sore a 
: 10,000s i Rie 10,000s nas 
1 1 2 10 28 4 6 25 
94 1 2 15 29 4 6 30 
3 1 2 20 30 4 6 35 
4 1 3 10 31 4 a 25 
> 1 3 15 a 4 7 30 
6 1 3 20 33 4 7 35 
7 1 4 10 34 4 8 Das 
8 1 4 15 35 4 8 30 
9 1 4 20 36 4 8 35 
10 2 4. 15 37 5 a 30 
iat 2 4 20 38 5 7 35 
12 2; 4 25 39 5 7 40 
13 2 5 15 40 5 8 30 
14 2 5 20 41 5 8 35 
15 Pe 5 25 42 5 8 40 
16 2 6 15 43 5 9 30 
17 Z 6 20 44 5 9 35 
18 2 6 25 45 5 9 40 
19 3 5 20 46 6 8 35 
20 3 5 20 47 6 8 40 
Halt : 5 30 48 6 8 45 
22 3 6 20 49 6 9 35 
23 3 6 25 50 6 9 40 
24 3 6 30 51 6 9 45 
25 3 7 20 O2 6 10 35 
26 3 7 25 53 6 10 40 
27 3 a 30 54 6 10 45 





no mistakes in arithmetic. Furthermore, you will have no difficulty in 
checking the table yourself. Among the 54 counties studied, there were 
three in which there were on the average 1.33 storks per 10,000 women 
| of child-bearing age. The average birth rate in these counties was 6.67. 
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TABLE II 


Do storks bring babies?—Analytical presentation 





Density of Number Average Cl 
; ass 
storks per of birth te 
10,000 women | counties rate 
iz33 3 6.67 Pole 
1.40 3 7.00 
1.50 6 7.08 
1.60 3 7.00 
16.67 6 7.50 
1.75 3 7.50 9.22 
1.80 3 7.00 
2.00 12 1OEZE 
2.33 3 8.33 11.67 
2.50 3 10.00 
3.00 6 12.50 
4.00 3 15.00 


Also, there were three counties with 1.40 as the density of storks and the 
average birth rate for these was 7.00, and so forth down the column. An 
inspection of Table II will show that the birth rate, although subject to 
fluctuations, steadily increases with an increase in the density of storks. 
This increase becomes even more marked if we divide all the counties into 
three classes according to the density of storks: densities below 1.7, densities 
between 1.7 and 2.1, densities above 2.1. The corresponding class averages 
are given in the last column of the table and show a decisive increase. 

My friend’s conclusion was that, although there is no evidence of storks 
actually bringing babies, there is overwhelming evidence that, by some 
mysterious process, they influence the birth rate! I know that some of 
you are skeptical and suspect that the original data of Table I were inten- 
tionally falsified to produce the astounding result exhibited in Table II. 
Let me assure you that these suspicions are unfounded. If anything, my 
friend was extremely lucky in collecting the data. Further, he was certainly 
very careful in classifying them in Table I so that it is extremely easy to 
make a complete analysis without performing any arithmetic. 

You will notice that all the 54 counties fall into six different groups. 
It happens that the nine counties forming a group have the same number 
of women, 10,000 in the first group, 20,000 in the second, etc. Proceeding 
further, we notice that each group of nine counties falls into three sub- 
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groups of three counties each. The subgroups are ordered according to 
the number of storks. Thus, the first group of counties contains three 
with 2 storks, three with 3 storks and three with 4 storks. The same 
kind of thing is repeated in all other groups of counties. The counties of 
the second group must have a larger area than those of the first. They 
have more women and the number of storks in them varies from 4 to 6, 
etc. Turning to the columns giving the total number of babies born, we 
notice constant fluctuations. Thus, within the first group, in the three 
counties with the same number 2 of storks, the numbers of babies born 
are 10, 15 and 20. In the next subgroup of three counties there were 
3 storks each and, impressed by Table II, we might expect that the num- 
bers of babies, though fluctuating, will show an increase compared with 
the first subgroup. However, Table I does not display an increase in the 
number of babies born corresponding to an increase in storks, as long as 
the number of women remains constant. This is true in the first group 
of nine counties and it is also true in any other group. So long as we 
consider a group of counties with the same number of women, an increase 
in the number of storks does not have any effect whatsoever on the number 
of babies born. We express this technically by saying that the conditional 
distribution of the number of babies born, given the number of women, is 
independent of the number of storks. Also, we may say that, given the 
number of women, the birth rate is independent of the number of storks. 
This finding appears to be contrary to the intuition of my friend who was 
much aggrieved, but it coincides with your intuition and my own. Thus, 
apart from a rather unusual regularity, the figures in Table I do not involve 
anything unexpected. 

How then can one explain the most unexpected features of Table II? 
Once you start to think about it, the explanation is very easy. The 
phenomenon was first noticed by Karl Pearson some fifty years ago and 
was called “spurious correlation.” 

The variable X, representing the density of storks, is a function of two 
variables S and W, say 


S 
X = filS, W) = 7 


Similarly, the birth rate Y is a function of B and W, say 
B 
Y = fa(B, W) =~ 


It happens in the present case that the two functions f; and f2 coincide 
since they are both quotients with W in the denominator. However, the 
coincidence of the two functions f; and fz is not essential. The essential 
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point is that these two functions depend on a common argument W and 
the fluctuations of W must create simultaneous effects upon the values of 
X and Y. Since W appears in the denominator of both fractions, any 
“abnormal” increase in W tends to diminish both X and Y simultaneously. 
Also, any “abnormal” decrease in W tends to increase both X and Y. As 
a result, X and Y are positively correlated. 

In another case we may be considering two functions, say 


g : 
xX =— and Z= BW, 
W 


where the letters S, B and W stand for some other observable variables. 
You will easily guess that in this case the presence of W in both X and Z 
will tend to create a negative correlation between X and Z. ‘This corre- 
lation has nothing to do with social or economic factors governing the 
variation of the three variables but is simply the result of our own arith- 
metic operations. These are, of course, only intuitive considerations and 
the exact conclusions require some algebra. 

You may be amused by computing the correlation coefficient R between 
the variables X and Y. This is quite easy if we make certain simplifying 
assumptions. 

We shall assume that 


(i) Given W, the variables S and B are independent; 
(ii) S and W are correlated and the regression of S on W is linear, say 
E(S| W) = Ao + AiW. 
Moreover, we shall assume that the conditional variance of S given 


W, say og) yw, 18 independent of W. 
(iii) B and W are correlated and the regression of B on W is linear, say 


E(B| W)=Co+ GW. 


The conditional variance of B given W, say og) yw, is independent of 
W. 
(iv) The expectation and the variance of the reciprocal of W exist. We 
shall denote them by 1/W, and o?w-1, respectively. 
According to the usual definition, 
nie E(XY) — E(X)E(Y) 
oxoy 


Thus, in order to compute R, we have to compute the expectations of X ; 
Y, X*, Y? and XY. Easy algebra gives 
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E(X) = #(=) =E E E(S | w) | 





A A\iW 
_5 oor Aa 
W 
Ao 
=—+A 
W, At 
Similarly, 
E(S? | W) 
PIN 
son = [21™) 
[ee Witte Ao ah ar 
9 (lg flaca ails SIE al 
Ww? 
1 QAA 
> (cu + =) (07g) + Ao?) + —— + Ai? 
Wi 1 
And it follows that 
1 
oS (cr ote =) os; w+ Ace? W-. 
W, 
Similarly, 
E(Y) = aL dg) 
ae W, 1) 
2 2 1 2 22 
Crete ho) OR Wy 1 C0 oo Wk 
Wi 
and 
1 AoC; + AiC 
E(XY) = AoC (ou ci =) af pai aa mre ese 
Wi, Wi, 


Upon substituting these results into the formula for the correlation coeffi- 
cient, we obtain the final result, 


R= 
AoC os? w- 


Sy ERAT ON tenon Aa Tein Coton po ye CrUmaMIATTD 
‘ez . ao? iw Aco? {(o= +. s\n iqok Co? 
Wi Wi 


It follows that the above intuitive considerations are only partly true. 
In the conditions under which the formula for R was deduced, it is neces- 
sary and sufficient for the lack of correlation between X and Y that either 
Ay = 0 or Co = 0 or both. If neither of these parameters is zero, then 
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the correlation R is positive whenever Ap and Co have the same sign and 
negative otherwise. You will remember that Ao and C5 are the intercepts 
of the regression lines of S on W and of B on W, respectively. 

The above analysis of correlation was made under very simplifying 
assumptions. However, you will have no difficulty in performing it in 
the more general case when the regressions of S on W and of B on W are 
represented by any polynomials. 

The theory of spurious correlation is not a very difficult matter and, as 
I have already mentioned, the phenomenon has been known for quite some 
time. With all due respect to Karl Pearson, I am inclined to alter slightly 
the label he invented. There is nothing spurious in the correlation between 
X and Y. When Ayg+0O and Cy) 0, the correlation between these two 
variables is quite real. Therefore the term “spurious correlation” seems 
to miss the point. The real point of the discussion is that the computation 
of the quotients X and Y is undertaken in order to study the correlation, 
not between these variables themselves, but between the social, economic 
or biological factors that these quotients are supposed to represent. It is 
the method of study that is faulty and, if the adjective “spurious” is to 
be used at all, it should be applied to the method of studying correlation 
between factors of primary interest; in the present case, between the num- 
ber of babies born on the one hand and the number of storks on the other. 
Only these two factors are of interest. It is suspected that each may be 
correlated with the third factor, the number of women. Therefore, the 
appropriate method of study is to compute the partial correlation between 
B and S with the influence of W eliminated. In proceeding in this fashion, 
there may be specific difficulties due to the lack of linearity, etc. However, 
these difficulties can hardly be decreased by using a spurious method. 

In spite of the fact that the phenomenon of spurious correlation has 
been known for half a century, many a practical statistician, as well as the 
general public, is misled by it from time to time. 

Thus we see “proofs” that the density of bars increases the frequency 
of crimes. This fact is likely to be true but the argument brought in sup- 
port of the assertion is faulty. In this and similar cases, there is no special 
harm done to Society. But there are other cases. Not so very long ago, 
I saw a detailed analysis of various problems of farm management. A 
considerable amount of money and effort was expended to collect the data. 
One of the conclusions reached was that, while the primary factor governing 
the employment of manual labor is the size of the farm, the density of 
employment increases with the increase of the proportion of the farm land 
which is arable. Again, this assertion may be*true but the argument is 
faulty and the final tables presented in support of the assertion, quite 
analogous to Table II, are entirely irrelevant. 
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There are even more regrettable cases on record in which the spurious 
method of studying correlation and regression was used and, moreover, 
was left undetected. Some years ago a scholar was interested in the 
question of whether or not railway rates were sufficient to cover expenses. 
His conclusion was that passenger traffic, on the average, barely paid its 
way and had fallen appreciably short at times while freight traffic, on the 
average, paid the net of the railway operation. For his analysis he used 
the data of Class I railroads as given in the annual volumes of Statistics 
of Railways in the United States, published by the Interstate Commerce 
Commission. There are somewhat less than 200 roads in Class I (185 were 
itemized in the 1923 volume). He decided to correlate the total cost of 
operation with the passenger and freight traffic. However, Class I railways 
are very different in length, ranging from 21 miles to over 10,000, and this 
variation must create correlations between the three variables considered 
which are irrelevant to the main problem. In order to eliminate the dis- 
turbing factor, the author used the data given under the heading “averages 
per mile of road” rather than the totals for each road which are also given. 
He then correlated these “averages per mile of road” which are the totals 
of each variable for a given railroad divided by the length of the road. 
The partial regression coefficients thus computed are expected to measure 
the average additional cost to the railroads which accompanies a unit 
increase in the particular service. If a partial regression coefficient is less 
than the corresponding rate, then the railroads as a whole are adequately 
paid for their services and make a profit. Otherwise they lose money or, 
at best, break even. Figures taken from the article are reproduced in 
Table III. 


TaBLeE III 


E = expenses per mile of railroad in $1 
F = number of ‘1000 ton-miles”’ of freight traffic per mile of railroad 
P = number of ‘1000 passenger-miles” of passenger traffic per mile of railroad 





be eran nos Multiple regression equa- 


tion, expenses being re- 


g FP P garded as dependent 


1919 1865 155.8 20.00 | H = 6.8F + 29P + 221.3 
1921 1936 131.7 16.11 | FE = 8.0F + 33P + 353.0 
1922 1858 142.7 15.10 | FE = 7.9F + 30P + 279.2 
1923 2022 174.3 15.42 | FE = 7.3F + 30P + 285.4 
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100 
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500 
500 
500 
500 
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1000 
1000 
1000 
1000 
1000 
1000 
1000 
1000 
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1500 
1500 
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TaBLe IV 


L = length of railroad in miles 

Z = total expenses in $10,000 

X = total freight traffic in 100,000 ton-miles 

Y = total passenger traffic in 100,000 passenger miles 


Z x 1g L Z A Y L Z x i 
61 535 69 100 71 535 69 100 81 535 69 
61 535 72 100 71 535 72 100 81 535 72 
61 535 75 100 71 535 75 100 81 535 75 
61 550 69 100 71 550 69 100 81 550 69 
61 550 72 100 71 550 72 100 81 550 72 
61 550 75 100 71 550 75 100 81 550 75 
61 565 69 100 ip. 565 69 100 81 565 69 
61 565 72 100 ig) 565 72 100 81 565 72 
61 565 75 100 71 565 75 100 81 565 75 
90 615 wa 500 100 615 71 500 110 615 71 
90 615 75 500 100 615 75 500 110 615 75 
90 615 79 500 100 615 79 500 110 615 (6) 
90 650 71 500 100 650 71 500. 110 650 71 
90 650 75 500 100 650 75 500 110 650 75 
90 650 79 500 100 650 79 500 110 650 79 
90 685 71 500 = 100 685 71 500 110 685 fal 
90 685 75 500 100 685 75 500 110 685 75 
90 685 79 500 100 685 <9 500 110 685 79 
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Tasie [V—Continued 








3000 180 1200 110 3000 190 1200 110 3000 200 1200 110 
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Miss Evelyn Fix was kind enough to prepare Table IV indicating what 
might have been the raw data regarding the expenditures of the railroads. 
This table is analogous to Table I. The partial regression coefficients of 
the same kind as appear in Table III are given in Table V. It will be seen 


TABLE V 
Z : ; , 
z= Fa expenses per mile of railroad in $1 
X ; ; 
r= 7a number of ‘1000 ton-miles’”’ of freight traffic per mile of railroad 
Yi é : ; 
y= ar number of ‘1000 passenger-miles”’ of passenger traffic per mile of railroad 
Average for all roads (189): zZ = 1943.3 
= 132.14 
g= 16.048 


Multiple regression equation, expenses being 
regarded as dependent: 


z= 7.9942 + 32.87y + 359.7 


that the conclusions they suggest are similar to those suggested by Table 
III and entirely contrary to those drawn from the original data of Table IV. 
In fact, upon inspecting this table it will be seen that, for each fixed size 
of railroad, the hypothetical expenditures Z are entirely independent of 
both the total freight traffic X and of the total passenger traffic Y. 

The article to which I am referring met with opposition from several 
authors. However, it is curious that none of the discussants thought that 
the method of constructing Table III was spurious. 

In broad circles of the general public, the opinion still prevails that, in 
order to conduct statistical studies, one must have enough funds, a few 
electric calculators and some common sense. Funds and electric calculators 
are very useful and common sense is just grand. It appears, however, that 
a little professional education is now and then also useful. 
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CHAPTER IV 
Statistical Estimation 


Part 1. Practical Problems and Various Attempts to Formulate 
Their Mathematical Equivalents 


(Based on a conference held in the auditorium of the Department of Agriculture, April 
8, 1937, 10 a.m., Mr. Alexander Sturges presiding.) 


In this conference I shall try to explain, from the modern point of view, 
the practical origin of the statistical problem of estimation and some of 
the early attempts at its solution. The material of the conference falls 
under four headings. First, under the subtitle “Applicational Roots of the 
Problem of Estimation,” there will be two examples of practical problems. 
This subsection is followed by two subsections under contrasting subtitles, 
one on “The Classical Bayes’ Approach” and the other on “The Modernized 
Bayes’ Approach.” The last subsection is given to the somewhat contro- 
versial methods advanced to circumvent the difficulties caused by the 
absence of exact information regarding the a prior distributions of the 
estimated parameters. 


APPLICATIONAL ROOTS OF THE PROBLEM OF ESTIMATION 


Practical problems of statistical estimation may be illustrated by the 
following examples. 

Example 1.—We are interested in a certain characteristic é of the totality 
of farms in the United States. This characteristic could be evaluated 
exactly if we had the necessary data regarding each and every farm. How- 
ever, the time needed for a one hundred percent survey of farms, and also 
the cost, would be prohibitive. The best that we can do is this: select a 
sample of farms for which we will obtain all the pertinent information. 
Then, the statistical problem of estimation consists in using the data of 
the sample to evaluate the approximate value of é. 

Example 2.—As a result of a certain illness, the blood of a patient con- 
tains a toxic substance A. The effect of the substance A can be neutralized 
by giving the patient an injection of a specified chemical B. The treatment 
will be effective if the dose of B is appropriately adjusted to the average 
content, say 7, of substance A per unit volume of blood of the patient. 
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The exact value of 7 could be determined by draining all the blood from 
the patient and then performing a large scale quantitative analysis of the 
blood. Since this is impractical, the doctor has to adjust the dosage of 
his injection, not to the exact value of y, but to the results of analyses of 
two or three small samples of the blood of the patient. Let x1, %2, -*+, Ln 
stand for the determinations of A in n samples of blood. The problem of 
estimation facing the doctor is to use these numbers 2, %2, -*-, Z im order 
to obtain a value which, presumably, does not. differ very much from 7. 

Stated in this form, the two problems of estimation just described are 
not mathematical problems and, therefore, cannot be given a mathematical 
solution. In fact, it is doubtful whether or not any sort of solution can be 
offered. Both é and y have a strictly defined meaning and can be computed 
exactly. However, in the practical situation, the data necessary for the 
evaluation of é and 7 are missing. 

In order to arrive at an acceptable solution of the problem of estimation 
based on calculus of probability, we must begin by translating the problem 
into the language of probability and by requiring that the method of select- 
ing the sample of farms and the method of determination of the substance A 
in the samples of blood satisfy certain conditions. 

The theory of probability deals with the general question of how fre- 
quently this or that event will occur in random experiments of a specified 
nature. Thus, in order to apply the theory of probability to any domain, 
this domain must involve some elements of randomness and we must have 
some information about the nature of the randomness. ‘Thus, if, in the 
case of the problem regarding the totality of farms in the United States, 
we are given detailed data for, say, 10,000 farms without any information 
concerning the method of selection, there is no way in which the theory of 
probability can be used to estimate é. The situation is different if we are 
told that a sample of 10,000 farms has been drawn at random from the 
total population of farms in a specified manner. For example, it may be 
specified that the manner of selecting the sample was such that every pos- 
sible combination of 10,000 farms had the same chance of being selected. 
By this, we mean that, if the sampling procedure is repeated many times, 
then each and every combination of 10,000 farms will be selected approxi- 
mately with the same frequency. By referring to Part 1, Chapter III, on 
“Sampling Human Populations,” the reader will see that the scheme of 
sampling just described has been labeled “unrestrictedly random.” This is 
not the only scheme possible and random sampling of farms may be com- 
bined with stratification, etc. The essential point is that, in order to apply 
the theory of probability, the statement of the problem must involve ran- 
domness in one form or another. Similarly, in order to use geometry to 


solve a given practical problem, the conditions of the problem must be | 
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stated in geometrical terms. For example, the question “What is the area 
of a red triangle?” cannot be solved by plane geometry because of the lack 
of necessary data expressed in geometrical terms. 

Once the random method of sampling farms is specified, the character- 
istics, say Xy, Xo, +++, Xn, of the n farms which will appear in the sample 
become random variables. Ordinarily, the probability distribution of these 
random variables will depend on the value of € and the application of the 
theory of probability to the problem of estimation becomes possible. 

The situation with the second example is quite similar. As long as 
nothing is known about the determinations of the content of A except that 
in a given case these determinations gave, say, 3.5 percent, 4.3 percent 
and 5.1 percent, the theory of probability is helpless to provide anything 
about the average content of A in the blood of the patient. However, 
repeated studies of the method of obtaining determinations of A, similar 
to the work of Matuszewski and Supinska, described in Part 2, Chapter I, 
may reveal the following. 

If the same method is applied many times, then the individual determi- 
nations group themselves about the true average content 7 in a manner 
characterized by the normal law of frequency. In other words, previous 
empirical studies may indicate that the relative frequency of determinations 
falling within any specified interval (a, b) differs but little from the integral 


1 b 
f e7 0)" /20? dr 
oV 2n Ja 


where o may vary from one patient to another. If so, then the future 
determinations of A contemplated for a given patient may be considered 
as random variables following the normal law with unknown mean 7 and 
unknown variance o°. 

Generalizing these remarks relating to two particular examples, we may 
say that the statistical problem of estimation, to be solvable by means of 
the theory of probability, must involve the following elements. 

(a) There must be one or more random variables, say X1, X2, °**, Xn, 
particular values of which will be given by future observations. These 
variables will be described as the observable random variables and, for 
the sake of brevity, their set will be denoted by a single letter H (the event 
point). 

(b) The probability distribution of the observable random variables must 
be known to belong to a specified family, say Ff. Ordinarily, the particular 
distributions belonging to F are represented by the same formula involving 
one or more parameters, say 61, 02, °°, 9, each capable of assuming a 
certain set of values. Thus, in order to specify completely any one of the 
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distributions belonging to F, it is sufficient to specify the values of the 
parameters 61, 02, *°°, 6s. 

The general problem of statistical estimation consists in devising a 
method of making assertions regarding the value of one (or more) parameter 
out of the set 61, 02, -°*, 0s, in relation to the particular values of the 
random variables X,, Xe, -*°, X, which will be furnished by observation. 

Example 2 provides a simple illustration of the general situation. Let n 
be the number of contemplated independent determinations of the content 
of substance A in samples of blood of a patient. Then the totality H of 
future observations is represented by n independent random variables, 
X1, Xo, °*+, Xn. The empirically supported postulate that each such deter- 
mination is a normal variable with expectation 7 and variance o”, amounts 
to postulating that the joint probability density function of H is represented 
by the formula 


1 


oV 24 





n 
pr (a1, Tae In) ves ( ) en 2 @i— )7/20? (1) 
where 7 and o are two parameters with unspecified values and 2, %2, ***, Ln 
denote possible values of the random variables X1, X2, -:-, Xn. Thus, we 
may say that, in this particular case, the actual distribution of E is known 
to belong to the family F of distributions, each characterized by the prob- 
ability density of the same form (1), with only two parameters, 7 and o. 

Due to the particular nature of the problem in which 7 represents the 
average content of substance A in the blood of the patient, it is possible to 
assert that 7 cannot be negative and cannot exceed one. Furthermore, there 
may be biological reasons insuring that 7 must lie between even narrower 
limits. Also, the same kind of argument will be applicable to o, with the 
result that it may be taken for granted that its value cannot exceed some 
specified limits. If we grant the approximation by the normal law, formula 
(1) and the limits for 7 and o summarize our postulated knowledge of the 
observable random variables X,, Xo, :*:, Xn. Our interest in the actual 
value of » leads to the search for a method of using the observed values of 
X 1, X2, +++, Xn which will be furnished by the chemical analyses to make 
assertions regarding 7. 

The words used, to the effect that we search “for a method of making 
assertions regarding 7,” do not describe the situation completely. We do 
not search for Just any method of making assertions, but for a method that 
is, from some convincing point of view, a satisfactory method. Even more, 
we are likely to prefer the method that is the best of all possible methods. 

While there is likely to be general agreement as to the desirability of 
using the best, or at least a satisfactory, method of making assertions 
regarding », there may be difficulty in explaining exactly what properties 
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a method of estimation should possess in order to qualify as the “best” or 
as “satisfactory.” And without having such an exact explanation, without 
knowing exactly what we are looking for, it is obviously hopeless to expect 
that we shall ever find it. If it were possible to devise a method of using 
the values of the observable random variables to predict exactly and without 
fail the value of the estimated parameter, then there would be universal 
agreement that the method in question is the best imaginable. However, 
it is obvious that, barring some very artificial examples, such a method 
does not exist and we have to put up with unavoidable errors. For example, 
whatever the method of using the determinations of the toxic substance A 
in a few samples of blood, it is obviously impossible to expect that the 
outcome of estimation will always give the true value of 7. On the con- 
trary, we may take it for granted that the estimate obtained will always 
differ from the exact value of 7. Similarly, whatever the method of esti- 
mating the characteristic é of the totality of farms in the United States 
by the use of a sample, smaller or larger errors of estimation are unavoidable. 

This being the case, what should be our definition of a “satisfactory” 
method of estimation? What should be the definition of the “best” 
method? 

Before attempting to answer these questions, let us consider the possible 
forms of the assertions regarding the estimated parameter which can be 
made using the values of the observable random variables. The sim- 
plest form is the so-called ‘point estimate.” The method of point esti- 
mation of a parameter 9 consists of defining a single-valued function, say 
6* (EH) = 6*(X1, Xo, °**, Xn), of the observable random variables and, 
whenever the observations give X; = 1%, X2 = 22, ***, Xn = Yn, Of making 
a rule of asserting that 0 = 6*(x1, v2, °**, 2n). The function 6*(#) is 
called the point estimate or the single estimate of 0. 

As already mentioned, in many cases it is more or less hopeless to expect 
that a point estimate will ever be equal to the true value of 6. In cases of 
this kind one is naturally interested in the precision of the estimate used. 
This precision may be usefully characterized by indicating the limits which 
the error in the estimate, presumably, could not exceed. As a consequence 
of this tendency, the results of practical investigations are frequently pub- 
lished in the form 6* + S, e.g. 10 + 1.3, or the like. This form of giving 
the results of statistical estimation suggests that, while the presumed value 
of the estimated parameter is 10, there is expected an error of estimation 
which, however, should not exceed 1.3 either way. It will be noticed that, 
in effect, this method of estimation amounts to computing from the results 
of observation not one but two different functions, 6* — S and 6* + S, and 
asserting that the true value of the parameter @ lies somewhere within the 
limits from 6* — S to 6* +S. 
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This procedure is obviously different from that of point estimation. It is 
described as the estimation by interval. In general terms, the estimation 
by interval consists of defining not one, but two functions of the observable 
random variables, say §(#) and 6(£), and of making a rule of asserting that 
the true value of 6 lies between the limits 6(71, v2, «++, 2n) and 6(21, 72, °°°, 
Xn), Whenever the observations give X; = 71, Xo = %2, -*+, Xn =n. The 
functions 6(#) and @(£) used in this manner are described as the lower and 
the upper estimate, respectively. Also, an occasion, it will be convenient to 
speak of ‘‘theta lower” and ‘“‘theta upper.” . 

The above two forms of estimation, by single estimate and by interval, 
are not the only possible methods. In fact, both are particular cases of a 
more general procedure of estimation by aset. The latter, while theoretically 
possible, does not seem to have much practical interest. In fact, if it is 
suggested to use a method of statistical estimation which may lead to the 
assertion, for example, that the mass of a particle is a number of grams com- 
mensurable with Vz and contained between zero and one, then we may 
confidently expect that the physicists concerned will show some signs of 
indignation. For this reason we shall limit our considerations to estimation by 
a single estimate and by an interval. 

With reference to single estimates, roughly one can say that a point estimate, 
to be satisfactory, should not differ from the estimated quantity “too fre- 
quently too much.”’ While intuitive, this statement is obviously too vague 
to serve as the basis for a theory of estimation. One way of specifying the 
problem exactly is to reduce it to the problem of estimation by interval. 
In fact, if this problem is solved satisfactorily, then the point estimate repre- 
sented by some specified interior point, e.g., by the midpoint of the estimating 
interval, would probably seem acceptable. 

If now we turn our attention to the problem of estimation by interval, we 
find that this problem is easier to put into exact terms in a manner likely to 
satisfy the practical statistician. In fact, there is one obvious requirement 
which any ‘satisfactory’ method of estimation by interval should meet. 
This is that if it is impossible to arrange that the results of estimation are 
correct always, we may at least expect them to be correct frequently. In more 
precise terms, it is natural to require that the estimating interval cover the 
true value of the estimated parameter with a high relative frequency and that 
it be possible to fix this frequency in advance. In probabilistic terms this 
postulate is expressed as follows: (i) when estimating an unknown parameter 
6, the satisfactory lower and upper estimates of 6 must have the property that the 
probability P{6(Z) S 6 < 6(E)} be computable and close to unity. If this 
probability has a specified large value a, say a = .99, then the practical 
statistician using the estimates 6(#), @(#) will have the assurance that, in 
the long run, his assertions regarding the estimated parameters in the form 
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6(£) S 6 S OE) will be correct about 99 percent of the time, and this is 
likely to satisfy him. In fact, in this case, each of his assertions regarding the 
value of 6 will be exactly comparable to playing a game of chance with the 
probability of winning equal to a = .99. Naturally, the actual realization 
of this high frequency of correct assertion regarding 9 depends on how closely 
the postulated properties of the observable random variables agree with the 
actual conditions of experimentation. Thus, if we postulate that the deter- 
minations of the toxic substance A are normally distributed, while in actual 
fact these determinations have, say, a skew U-shaped distribution, then the 
interval estimation of » based on the assumption of normality need not give 
the expected frequency of correct results. However, the basic agreement 
between the postulates of the theory and the phenomena studied is omni- 
present in all problems of application and, in this particular respect, the prob- 
lems of estimation do not present any sort of exception. 

Suppose for a moment that the problem of determining the lower and the 
upper estimates satisfying requirement (i) is solved and that there is more 
than one solution. Suppose, for example, that two pairs of functions 6(£), 
6(#) and 3(E£), 3(E) both satisfy the condition that 


P{G(E) S$ 6 S O(L)} = Pld(Z) $0 S 0D)} =a. 


Thus, whether the practical statistician uses 9(Z) and 6(Z) or 3(E) and 3(£), 
his assertions regarding the value of the estimated parameter will be correct 
with exactly the same long run relative frequency a, chosen by himself. In 
these circumstances, the statistician will be faced with the problem, which we 
shall denote as problem (ii), of choosing between the estimates §(#) and 6(E) 
on the one hand and the estimates 3(Z) and 3(E£) on the other. Naturally 
he will consider the question, which of the two pairs of functions will provide a 
more exact estimate. If possible, the practical statistician will select for his 
use the particular pair of functions for which the length of the estimating inter- 
val is the least. Should it be impossible to satisfy this condition uniformly 
so that the selected pair, say 6(E), 6(£), always gives narrower limits for 6 
than any alternative pair, 3(E), 3(Z), that is, 


0 5 62) — 6(F) S 0(#) — oh), 


then the practical statistician is likely to formulate some sort of second 
best requirement substituting “most frequently” for “always” or some such. 
The essential point in this discussion is that, after finding several estimating 
intervals capable of covering the true value of the estimated parameter 
with the same relative frequency, the choice between these estimating 
intervals will be based on considerations of their length. 

You will realize that, compared with the statement of the practical 
problem of statistical estimation as illustrated in the two examples dis- 
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cussed at the outset, we have gone a long way toward transforming the 
practical problem into a mathematical problem. However, even now the 
problem has not been made entirely precise. 

The complete specification of the problem of estimation depends on the 
assumed conditions. As we have already emphasized, these conditions 
must imply that the observable random variables X1, Xo, -:+, Xn are ran- 
dom variables with a distribution connected in some way with the quantity 
that one desires to estimate. For example, the observable random variables 
may be known to be of continuous type and it may be known that their 
probability density function, say 


Px | o(X1; LQ, °° %, In | 61, 62, Seana 5, Os), 


has a known form and depends on some s parameters 61, 62, ---, 0s the values 
of which are uncertain. Our problem may be to estimate one (or more) of 
them, say 6;. Generally, problems of estimation vary in the amount of 
knowledge of the distribution of the observable random variables and the 
connection between the quantity to be estimated and the distribution need not 
be so simple. However, the assumptions just made are sufficiently illustra- 
tive and we shall adhere to them. 

In addition to the data regarding the distribution of the observable random 
variables, the problems of estimation vary in respect to a very important 
factor which is our assumed knowledge regarding the quantity to be estimated, 
6,, and also regarding such other unknown parameters 62, 63, «++, 0; aS may 
appear in the probability density px | 6(%1, Y2, +++, Xn | 01,02, °++, 0s). Inmany 
practical problems, a certain amount of knowledge regarding these parameters 
is always available. For instance, the conditions of the above example 2 
imply that the quantity 7» is a non-negative number not exceeding unity. 
Also, there may be some additional items of information which affect the form 
of the problem of estimation. The most radical difference in this form depends 
on whether or not the parameters 61, 62, ---+, 0; are themselves random variables, 
the distribution of which 1s known sufficiently to be used in calculations. In 
relation to any given problem, this paramount question has to be answered 
by the practical statistician treating it. Our purpose here will be to describe 
the nature of the problem of estimation under both sets of conditions, when 
the unknown parameters are random variables with a known distribution 
and when they are not. 


THE CLASSICAL BAYES’ APPROACH 


Historically, the first precise treatment of the problem of estimation refers 
to the case when all the unknown parameters are random variables with a 
postulated distribution. ‘Therefore, we shall begin our exposition with this 
particular case. The classical statement and solution of the problem are 
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based on the famous formula of Bayes.1 After presenting them, we shall 
outline a more modern approach to the problem treated under the same 
conditions. 

Consider, then, a set of observable random variables, X;, Xe, +++, Xn with 
a probability density function px | 9(x1, %2, -++, Xn | 61, 02, °**, 95) which 
depends on s unknown parameters, 61, 92, ---,6;. Assume that the conditions 
of the problem imply that these parameters are also random variables with 
a probability density function V,(01, de, ---, ds). Ordinarily, the distribu- 
tion determined by this function is called the a priori distribution of the 
parameters 6;, 02, ---, 0; and is contrasted with the a posteriori distribution, 
obtainable from Bayes’ formula, say 


(31, de, Wana? Os | U1, %2, °°", Hip) 


Ay Wo(d1, Fo, ---, 0s) Px | o(T1, Vo, °°°, Ln | 31, 0g, °°, Js) 2) 


fo fre(0, Do, +++, Js) 


x Px} 9(X1, USy te, | v1, Jo, o° -, 0,) dd; dbs Shells dd; 


2 
Here the integration in the denominator extends over all systems of values 3 


of the 6’s which are compatible with the values 21, x2, -- +, 2, of the observable 
random variables. Integrating & for Je, 03, ---, ds over all systems of values 
compatible with the fixed value @; of 0,, we obtain the a posteriori probability 
density function of 6; given the values x1, x2, --:, t, of the observable random 
variables, say 


o(8; | 21, v2; wa ie) =[---fvar.d9s see dd». (3) 


The product in the numerator of formula (2) represents the joint probability 
density function of all the parameters 61, 62, ---, 6; and of all the observable 
random variables X;, Xo, ---, Xn. Integrating it for 31, Jo, ---, ds for all 
combinations of their values compatible with the fixed 21, rg, -+-, 2p, we 
obtain the absolute probability density of X,, Xe, °+°, Xn, say 


px (21, t2, ) 
=f f (0, 2, °° +, 0s) Px | (11, Z2, ***, In| 01, V2, ++, 8s) UI, Ade +++ dds. 


This expression appears in the denominator of formula (2). 
The function ¢(?; | 11, 12, ***, In) is the basis of the classical procedure 
of estimating 6,. Its interpretation is as follows. We visualize a set of cases, 


1Thomas Bayes: “An essay towards solving a problem in the doctrine of chances.” 
Phil. Trans., London, Vol. 53 (1763), pp. 376-398, Vol. 54 (1764), pp. 298-310. 
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to be described as ‘Shuman experience” and denoted by H, in which we shall be 
confronted by the problem of estimating 6,. In the particular cases which 
form human experience, the values 01, 2, ---, 3s of the unknown parameters 
61, 02, -++, 8, vary from case to case and the function W(01, Je, «++, bs) char- 
acterizes the frequency distribution. For example, the relative frequency of 
cases when 6, will fall between any specified limits a < 6, < 6b is obtainable 
from W,, by integrating it for 3; between a and 6 and for ve, 03, +++, & within 
the extreme limits of their variation. . 

In parallel with changes in the values of the 6’s, the particular cases of 
human experience will differ in the values, say 21, v2, -++, Yn, assumed by the 
observable random variables and this variation is characterized by the prob- 
ability density function px) 5(%1, 22, +++, In | 01, Je, +++, bs). Now, within 
human experience H, isolate a part, say H(21, x2, --+, Yn), in which the value 
of X, is 71, the value assumed by Xq is 2, etc. Naturally, within the series 
of cases H(21, 22, +++, Ln), the values of the @’s will vary. The above formulae 
(2) and (3) give the probability density functions relating to the part H(a,, 
2g, ***, Xn) of human experience, joint of all the 6’s and of 6; alone, respectively. 

Thus, the exact statement of the classical form of the problem of estimating 
6, is as follows: We have observed X, = x1, Xq = Xo, -++, Xn = Ln—therefore 
we appear in part H(x,, x2, +++, Xn) of human experience; what is the most 
probable value, say 6; (x1, £2, +++, Xn) of the parameter 6;? The value 6;(2}, 22,. 

++, Xn) required (called the a posterior: most probable value of 6; given 
X1 = 2, X2q = Xo, +++, Xn = Ln) is simply that value of 3 for which g(? | L1, Le, 
+++, %,) IS a Maximum. 

The a posteriori most probable values of the estimated parameters have been 
used extensively as unique estimates since the time of Bayes. Also, once we 
place ourselves in the specified section H(a1, x2, +++, 2n) of human experience 
and limit our consideration to probabilities referring to this section, there is 
no difficulty in treating the problem of estimation by an interval. In fact, 
let 6 = O(x1, Xo, +++, 2%) and @ = A(xj, xe, -- +, 2) be two numbers which, for a 
specified a between zero and unity, satisfy the condition 


ry 
ype eeiees a | i tae een -{ g(a; | 11, 2, ***, tn)dd; = a. 4) 


Obviously, there is an infinity of pairs of numbers satisfying this condition 
and, if @ and @ are to be used to estimate 6,, it is natural to require that in 
addition to satisfying (4), the two estimates also minimize the difference 


A(x, EOL % tn) ‘7 O(x1, T2, °"*y Loy (5) 


When the probability density o(0; | x1, 22, -: +, tn) is continuous, the problem 
of minimizing (5) subject to restriction (4) is trivial and the solution provides 
the desired estimating interval. This interval will be called the classical 
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Bayes’ estimating interval. Its properties are: (a) within the section H(2j, 
Zq, ***, Xn) of human experience, the frequency of cases where the value of 
6, will be within the interval (6, 6) equals the number a selected by the statisti- 
cian himself, and (b) no interval shorter than (6, @) having the property (a) 
is in existence. These properties of the classical Bayes’ estimating interval 
may be judged a sufficient justification for using it in practice. However, 
it is important to be aware of other possibilities which are available when the 
a priorz distribution of the unknown parameters is known. 


THE MODERNIZED BAYES’ APPROACH 


In considering these alternative possibilities we should ask ourselves, Why 
should we refer the probabilities of success in estimation to the section 
H (2, %2, ***, %n) of human experience? The point of this question is that, 
even if one is professionally engaged in solving problems of estimation as 
a matter of daily routine, it will only be most exceptionally that one will 
be confronted with a set of observations, say 21’, 12’, +++, Xn’, which has 
already been observed in the past. Normally, the whole experience of a 
statistician estimating a given parameter will be composed of cases in 
which the sets of observations are all different. Thus, this statistician’s 
life experience will consist of cases each extracted from a different section 
H (x1, 2, ***, tn) of human experience. 

Let us illustrate this by an example. Consider a case in which a contract 
between a beet sugar factory and a group of beet growers provides for a 
varying price per ton of beets depending in some way upon the interval 
used to estimate the average sugar content in a carload of roots. In prin- 
ciple, the sweeter the beets, the higher the price. However, if the estimating 
interval is broad, a certain decrease in price is allowed due to uncertainty 
as to the actual sugar content. 

In order to determine the price, a sample of beets is drawn out of each 
carload and several independent determinations, say X1, X2, ***, Xn, of 
the sugar content are made. These determinations are then used to com- 
pute an interval estimating the average sugar content in each carload. 
Ordinarily, it is assumed that the variables X1, X2, +++, Xn are independent 
and follow the normal law of frequency. On this assumption, the prob- 
ability of observing twice (i.e. for two carloads) the same system of values 
of the X’s is equal to zero. Since the determinations are made only with 
limited accuracy, strictly speaking, the variables X1, X2, ---, X» are not 
of continuous type and the probability of observing the same system of 
their values twice (or more) is not exactly equal to zero. Nevertheless, 
the probability is extremely small and it is safe to say that the experience 
of the factory will consist of cases where the variables X1, X2, +++, Xn 
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assume a multitude of different systems of values, with scarcely any repe- 
titions. In these circumstances, it does not seem reasonable to insist that 
the method of estimating the mean sugar content insures that, within each 
section H (x1, %2, ***, Xn) of human experience, the probability of covering 
the true mean sugar content by the estimating interval be exactly equal 
to the preassigned «. On the contrary, it may be presumed that both the 
farmers and the administration of the factory will agree that the desirable 
method of computing the estimating interval should insure (a) that the 
overall relative frequency [contrasted with the relative frequencies relating 
to each section H (x1, 22, ***, Yn) separately] of successful estimation be 
equal to the selected number a close to unity and (b) that, at least on 
the average, the estimating intervals be as short as possible without infring- 
ing condition (a). 

It will be seen that here we come to a novel aspect of the problem of 
estimation. In order to arrive at its exact formulation, it was necessary 
to realize the fact that, whatever the method used, the outcome of the 
process of estimation based on some observable random variables has itself 
the property of being random. It is curious that this fact, noted by Laplace 
and Gauss, was later forgotten and did not reappear in the literature until 
in the 1930's. 

Upon reflecting on the various practical problems of estimation, it is 
easy to see that a great many of them resemble the situation implied by 
the contract between the beet growers and the sugar factory. However, 
there are examples in which the appropriate point of view on estimation 
seems to be the classical Bayes’ described above. Consider the following 
situation. 

Suppose that the observable random variables X1, X2, -:-, Xn are some- 
thing like the outcomes of an aptitude test taken by a young man pre- 
paring to select a profession for himself. Suppose further that the apti- 
tude test measures exactly the attributes of the individual so that, while 
X1, Xo, °++, X, vary from one individual to the next, they are constant for 
each particular individual. Our final assumption is that the individual’s 
success in the various available professions depends upon the parameter 6; 
to be estimated. 

Now consider a particular individual, a Mr. John Frederick Smith, for 
whom it was found that X; = 2%, Xe = X%2, °**, Xn = 2. It is obvious that 
Mr. John Frederick Smith’s point of view on estimation will be different 
from that of the administration of the beet sugar factory. The experience 
of the latter will involve many carloads of beet roots with varying mean 
sugar content and the important point is to insure overall high frequency 
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of successes in estimation combined with satisfactory precision. On the 
other hand, the whole life of Mr. John Frederick Smith will be tied up 
with just one section of the whole human experience, namely with the 
section H (x1, 2, +++, 2). Therefore, if Mr. John Frederick Smith’s actions 
are to be adjusted at all to outcomes of statistical estimation, it is natural 
for him to insist on probabilities referring to H (x1, v2, +++, %,) rather than 
to the whole human experience. 

To illustrate this point more clearly, let me refer to phenomena of 
racial discrimination which still infest substantial parts of human society! 
Imagine that the variables X;, X2, --:, Xn determine the race of Mr. John 
Frederick Smith and that he is forced to live in a place where the general 
circumstances of life of individuals of one race are sharply different from 
those of another. It is obvious that, having established his racial identity, 
Mr. John Frederick Smith will be wise to build his own life in conformity 
with statistical data relating to particular races taken separately, rather 
than to the overall figures concerned with the total human experience. 

The case of John Frederick Smith illustrates, then, the general situation 
where the classical Bayes’ approach to the problem of estimation appears 
reasonable. The existence of such cases, however, should not blind us, as 
it did for over a century, to the great mass of other cases in which the 
restrictiveness of the classical approach can be usefully relaxed. Many 
important results in this direction, primarily concerned with point estima- 
tion, are due to Wald, Wolfowitz, Girshick and others. We will consider the 
following problem. 

Let X1, X2, --:, Xn denote a set of observable random variables and let 
6, be the parameter to be estimated. With each system 2, 22, ---, % of 
possible values of the X’s we shall connect a set 6(%1, 2, +++, Zn) of possible 
values of 6; to be used for estimating 6;. Whenever the observations yield 
X1 = 21, Xo = Te, -**, Xn = In’, we Shall substitute the observed values 
into the function 6(x1, %2, ***, Yn) and assert that the unknown 6; is one 
of the numbers included in 0(2y’, 12’, -**, Yn’). Let « be a fixed number, 
0<a< 1, close to unity. 

We shall say that the set (21, 22, ***, Un) is the modernized Bayes’ estt- 
mating set (MB for short) corresponding to the confidence coefficient a rf 
it satisfies the following two conditions: 


(1) The relative frequency of cases within the whole human experience 
(i.e., the probability) where the set 0(X1, X2, +++, Xn) will cover 
the true value of 6; 1s equal to a; 

(2) Of all sets satisfying condition (1) the set 6(X1, Xo, -+°, Xn) has 
the smallest expected Lebesgue measure. 
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In order to deal with the general case, it is convenient to speak of 
an estimating set. However, if the reader tries to apply the following 
theory, he is most likely to find that the modernized Bayes’ estimating set 
reduces to an interval bounded by two functions, say 6’(X4, Xo, -*+, Xn) 
SAX, X93) eri, Xai 

In order to solve the problem of the MB set, it is sufficient to express 
the two conditions (1) and (2) in formulae and to apply an easy lemma, 
occasionally described as the Fundamental Lemma in the theory of optimum 
tests. Applied to the present case, the lemma asserts the following. 


(a) If Fi(h, te, --+, tr) and Fo(t, tz, --+, tn) are any two functions in- 
tegrable over any measurable set of systems of values of the arguments 
Rettig ete 

(b) If wo ts a set of systems of values of t1, to, --+, tn which contains all 
systems (t1, tg, +--+, tn) where 


Fy(t, to, Say te) < aF’2(t, to, Bass tn) 
and none of those where 
Fy(t, to, -* *y tn) = aF’s(ty, te, alee tn)} 


(c) If wits a measurable set of values of t1, tz, ---, tn such that 


fof Peat dts + at = foe f Pe dts dle +++ dy 


then 


foo fifi ae dia «++ al Sf f Prd dts ++ dy 


In other words, of all sets w for which the integral of Fz has the same 
values, the set wo ascribes to the integral of F; the smallest possible value. 

Now let us return to the search for MB sets. For this purpose, consider 
the systems of possible simultaneous values (@, 271, 22, ---, 2n) of the estimated 
_ parameter 6, and of the observable random variables X,, Xe, +++, Xn. In 
order to visualize these systems, it will be convenient to consider a space S 
of n + 1 dimensions, with axes of coordinates of x1, x2, -+-, t2 and & The 
totality of MB sets can be interpreted in the space S as a region (or a set) 
Wo containing all points with arbitrary coordinates x, 22, +++, t, and with 
coordinate @ belonging to @(a1, r2, +--+, %n). As usual, W will stand for the 
whole sample space. 

With this interpretation, condition (1) in the definition of MB sets can be 
expressed by equating to a the integral over wo of the joint probability density 
function of 6, and X,, X29, ---, Xn. Thus, 
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ry op px (a1, 72, °° aE aft | U1, 02," °*4ie) Gv day >>: dr, (6) 


“Ot feof Px (%1, 2 Pilates tn) o(d | x1, T2, Pcie oy] ad dxy an Cx4 
wo 


Similarly, the second condition defining MB sets is expressed by the formula 


fof pen ae = ge 24) dd dx, dxg +--+ dxn 
Ww 6(x1,22,°* *,n) 


on fo fox we a, dd dx, dx2 oe ax 
Wo 


= minimum. 


This last formula is due to the fact that the measure of the set 0(21, to, «++, In) 
is equal to the integral of unity extended over the set. The application of the 
Fundamental Lemma leads to the conclusion that, in order to determine wo 
it is sufficient to find a constant a and a region wp including all points where 


Px (X11, t2, °**, Xn) < apx (2x1, Za, +++, tn)e(O| 21, Za, °**, In) 


and none of those where 
px(«1, or Ta) es apx (x1, U2, °"", In) pl | U1, %2, °° °*, ie) 


and such that 
f. f Px (2X1, Za, ++", tn) Od | 71, %2, °°", In) dS dx, +--+ din = a. 
wo 


Since the probability density px is never negative and since we can ignore 
points where it is zero, it is seen that the region wp is defined by the condition 


ed | x1, X2; etn Pesath} (7) 


where a is an appropriate constant. Further on we shall illustrate the proce- 
dure in a practical example. It consists in writing down the a posteriori 
probability density function 9(? | 21, X2, ***, tm) of the estimated parameter 
and in substituting it into formula (7). This formula must then be solved 
with respect to 3. The solution, in the form of one or more inequalities 
(combined with equalities) imposed on #, determines the set 0(71, 22, +--+, Un). 
Obviously, this solution will depend on the chosen a. The value of this con- 
stant is adjusted to satisfy condition (6). 

In order to illustrate the various concepts discussed we shall now consider 
some examples. First we shall adopt the classical Bayes’ point of view and 
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illustrate how the a posteriori most probable value of a parameter and the 
classical Bayes’ estimating interval depend on the a prior: distribution of this 
parameter. Next, on a slightly different example, we will illustrate the rela- 
tionship between the classical and the modernized Bayes’ estimating intervals. 
In both cases we shall be interested in the conceptual rather than in the 
practical numerical side of the problem. For this reason, the examples are 
especially selected so as not to involve technical complications, cumbersome 
integrals, etc. 

We are going to consider n observable random variables X;, Xo, ---, Xn, 
all independent and each known to be uniformly distributed between zero and 
a positive number 6. We shall assume that our knowledge of this number 6 
is limited to the double relation, 0 < @ S 1, and we shall consider the problem 
of estimating 6. Thus, in this example, the joint probability density function 
of all the observable random variables depends on only one unknown pa- 
rameter, namely 6, and is given by the formula 


pe(t, %2,°*-)n| 8) = — tor O S21, Ta, °°; tne 


(8) 
= 0 elsewhere. 

In order to illustrate the use of Bayes’ formula, we shall assume that 6 
itself is a random variable with the probability density function of the simple 
form, 

(6) = me" for0 <@=1 
(9) 


==. 0) elsewhere. 


Here m represents a positive number. Let the letter 2 without any sub- 
script denote the greatest of the numbers 21, 22, ---, %, which may be given 
by observation as particular values of the variables X;, Xo, ---, Xn. The 
capital letter X without any subscript will denote the random variable defined 
as the greatest of the X;, Xo, ---, Xn. Thus, z is a particular value of X 
which may be given by observation. The definition of the random variables 
X1, X2, +++, Xn implies that 0 S X S 4 so that, if the observations have 
determined a value x of X, then 92 x. Substituting (8) and (9) mto (2), 
we obtain the a posteriori probability density of 6, say 

gn—n—l 
(9 | 21, 22, +++, 2) = ————— for0O<zS@0<1 
if ee de 
zx 


(10) 
= 0 elsewhere. 


If m ¥ n, then this formula gives 
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(m jae. NOG tas 





ig das vic Solar aaa fon 626.51 

— 2 
(11) 

= 0 elsewhere. 
Otherwise, if m = n, then 
(0 | 1, 22, +++, %n) = — for. Orr 01st 
6 log x 
(12) 
= 0 elsewhere. 


It follows that the a posteriori distribution of 6, given x1, x2, --+, Zn, depends 
effectively only on the greatest x of the x1, %2, +++, Zn. Given the value of z, 
the most probable value of @ depends on the relation between m and n. If 
m=n-+ 1, then ¢(6 | 11, 2, ***, Ym) iS constant within the interval (z, 1), 


fora S¢@s1 





II 


9(6| x1, 72; wicket Ty.) 
1—z 


= 0 elsewhere. 


Thus, in this particular case, all the numbers of the interval (z, 1) are the 
a posterior: most probable values of 6. Also, in this case, any value may be 
ascribed to 8, subject to the restriction, 


and then the corresponding value of 6 will be 
§=9+oa(l —2x) $1. 


Hence, we have at our disposal an infinity of pairs of estimates varying be- 
tween the extremes, say 


0 = 2, 6) = (1l—a)r+a, 
and 
6” =ar+1-—-a, Gti. 


Any pair of estimates such as these may be used and the consequences will 
be the same: the probability of the statement regarding the true value of 6 
which is in the form 9 S @ S 6 is equal to the preassigned number a. Also, 
whatever be the choice of the pair of estimates within the set indicated, the 
precision of the assertion regarding 6 will always be the same because, for all 
the estimates considered, the difference 6 — @ has the same value, namely, 
a(l — 2). 

No such arbitrariness of choice exists when m4n+1. If m<n-+1, 
then the a posteriori probability density of @ decreases as 6 varies from z to 1 
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and the most probable value of @isz. Also, the left boundary @ of the classical 
Bayes’ estimating interval is 9 = zx. In order to obtain the right boundary 
§, we have to solve the equation, 


6 
f (0 | T1,%2,°*", Tn) dé=a. 


If m ~ n, then this equation gives 
§ =[(1 — a)a™" + a], 


Otherwise, if m = n, then § = 2'~%. 

If m >n-+1, then the situation is reversed, the a posteriori probability 
density function of 6 increases with the increase of 6 from z to 1, and the most 
probable value of 6 is equal to unity, irrespective of the observed value of z. 
In this case, 9 = 1 and 8 satisfies the equation, 


1 
f 9(0| T1, T2, °° > Mn) d§ = Qa, 
9 
which reduces to 
6 = (az™" +1 —a)/m™), 


The purpose of the above discussion is to show that both the single esti- 
mate, represented by the a posterior most probable value, and the classical 
Bayes’ estimating interval may depend very strongly on the a priori distri- 
bution of the estimated parameter 6. In order to emphasize this circum- 
stance, all the results obtained are collected in tabular form. 


Estimates of 6 in relation to m 


Most probable 


Theta lower Theta upper 
m value (2) (2) 
6(z) D 
m<n+1 
(a) m¥#n L x [((l —.a)2™—™ + qt! (m—2) 
(b) m=n x x gia 
m=n+1 a<6s1 -o0al—al=) 6+a(1 — 2) 
m>n+1 1 [ax™—" + (1 — q)]!/(m—n) 1 


Figures 1 through 4 illustrate the situation which corresponds to a fixed 
value of n, n = 4, and to three different values of m, m = 4.5, 5.5 and .5, 
respectively. It is seen that for any given x the most probable value of 6 
and also the classical Bayes’ estimating interval depend very much upon 
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Ficure 1 


A priori distributions of 6 

















the a priori distribution of this parameter. According to the properties of 
(0), the most probable value of 6 may be z itself or unity. The estimating 
interval may begin at x or end with unity; it may be wide or narrow. The 
interesting point is that a very substantial change in estimates of 6 occurs 
when the a prior distribution ¥(6) changes very moderately, say from 
v, (6) to W.(0). 
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FIGURE 2 
Classical Bayes’ estimating intervals corresponding to ¥;(6) 


my = 45; n=4: 4=18 


6 (x) 





ol Wie 63 oh eT) Se 


Situations of this kind are quite common and were noticed long ago. 
When the a prior: distribution of the estimated parameter is known exactly, 
there is no difficulty involved. Frequently, however, the a prior distri- 
bution of the estimated parameter is not known and an effort to use the 
classical Bayes’ approach is combined with the use of a more or less arbi- 
trarily selected function which, it is hoped, approximates the a priort prob- 
ability density. In such cases there may be difficulties. 

Now we shall illustrate the relationship between the classical and the 
modernized Bayes’ approach. We shall use the same problem of estimating 
6 described above, but, in order to simplify the algebra, we shall substitute 
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Figure 3 





Classical Bayes’ estimating intervals corresponding to V2(@) 


Me 50, t= 439 a= 8 


O6(x) =l= 8 (x) 


. Q(x) 








| Rr oalites siian. 4 S517 PG ilai. Zawils Bair. nee 10x 


m =2andn= 3. Repeating the discussion of the preceding pages, we find 
easily that the classical Bayes’ estimating interval is given by 


0) =x and 6(z) = 


= La A Yim For Otter aad POTS 
ee) UA eines 














The length of this interval, say B(z), is 


x 
B(x) = ——————-_ -— 7, for 0; resel: 
1 — a(l — 2) 
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Figure 4 


Classical Bayes’ estimating intervals corresponding to ¥3(@) 
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Further, easy calculations indicate that the joint probability density of 6 
and X is 


6x? 
Do,x (3, x) aes forO<2<S0K1. 


Integrating this expression for 3? between limits x S$ 3 S 1 we obtain the 
absolute probability density of X alone, 

px (x) = 62(1 — 2). 
It follows that the most frequent value of xis one half. Finally, the a posteriori 


probability density of @ is 


apg ed 
ie Basoit is — for 0° = 71 and Ee 7 Sas | 
px (x) 1—2z20 





(8 | 2) 








) 
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Turning to formula (7), we see that the MB estimating set is determined by 
the formula, 
spate 


L— 20? 


WA a 
os at : . (14) 
al—@z 


We can simplify the further writing somewhat if we substitute 1/t? for a. 
Then formula (14) will reduce to 


4 0 
ost] (15) 
1l-—z 


Here a and/or t must be adjusted to the value of a. It will be remembered 
that conditions of the problem imply that 0 <3 < 1. Thus inequality (15) 
implies a real limitation on the value of @ only if 


x 
ine 
l-—z 


1 
Tee 


Furthermore, (15) is compatible with the necessary condition x S 3 only when 





IV 


a, 


or 




















or if 


os 











or 
g—27 PPS 0. 


This condition will always be satisfied when the roots of the quadratic are 
complex. This will happen if ¢ > 14. Otherwise (15) will be used only for 
values of x outside of the interval between the two roots of the quadratic 
z*?—xz+#?. On the first assumption, namely, that the value of ¢ correspond- 
ing to the selected a exceeds one half, the modernized Bayes’ estimating 
interval for 6 is, say 








x 1 
6’ Sead the == 6!’ (x for,0, <tc ss ’ 
(x) ames (x) ren: 
(16) 
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The length of the interval is, say 
M(x) = 6''(x) — 6'(2). 


Now, let us see how to adjust ¢ to the selected a. For this purpose we write 
the expression for the probability that the random variables X and @ will 
satisfy the double relation, 


1 6" (x) 
Pio(X) 5050") = f pe) poi-(0) doae 
0 O(a 


1/(1+#?) t{z/(1—x)]4 
= {| px (x) f Po\2(d) dd dx 
0 z 


1 1 
ita i px (x) f Po\2(0) dd dz. 
1/(1+#?) x 


It will be seen that this probability is an increasing function of ¢. Upon 
substituting the expressions of px(x) and of p,;,(0) given above and upon 
performing the integration, we find that 
4* + 117 +9 ZK 1-? Z) 
4(1 + #2)? St 





Pi{¢(X) 365 0'(X)} = 


for} St. (17) 


The requisite value of t can be found by equating this expression to @ 
and by solving with respect to ¢t. First, however, we must assure ourselves 
that this value exceeds one half. For this purpose we first decide on a = 0.8. 
Upon substituting t = 14 into (17), we obtain the value P = .2593. Hence 
the requisite value of ¢ must be greater and formula (17) may be used to 
establish it. Easy interpolation gives ¢ = .79614. This completes the 
determination of the modernized Bayes’ estimating interval for 6. Figure 5 
presents both the classical Bayes’ estimating interval corresponding to 
formula (13) with « = 0.8 and the modernized, corresponding to (16) with 
t = .79614. If he uses either, the statistician may be sure that he will be 
correct in about 80 percent of the cases. Using the classical solution (13) 
he may also be certain that, should it be possible to isolate enough cases 
of the general human experience in which X has the same value 2, the rela- 
tive frequency of successes in this section H (x) of human experience would 
also be equal to « In addition, no shorter interval having this property 
can be found. If he uses the modernized solution, no general statement 
regarding any particular section H(z) can be made. The inspection of 
the graphs on Figure 5 indicates that the MB intervals are sometimes 
shorter and sometimes longer than the classical ones. In using the MB 











i) 


STATISTICAL ESTIMATION 179 


Ficure 5 


Comparison of the classical and the modernized Bayes’ estimating intervals 
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intervals, the statistician may be certain that within the whole human 
experience he will also be right in about 80 percent of all cases. Further- 
more he may be certain that no other estimating intervals corresponding 
to the same «@ will have their long range average length shorter than the 
intervals computed from (16). Whatever the interval, say 6:(x) S 62(x), 
the expectation of its length is computed from the formula, 





1 
Eléa(z) — 6,(2)] = iH [02(2) — 0(2)}px (x) ae. 
0 


| Upon substituting into this formula alternatively the expressions (13) and 
| (16), it is found 
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E[B(X)] =1— 5 yet 





= .2868, 


a=.8 


1—? PERE oat 
=| waa Wh = 2522. 


3 : 
E[M(X)] = a are sin ; 4d +B 


+212 

It is seen that, by sacrificing the requirement underlying the classical 
approach that the probability of success in estimation be equal to « = 0.8 
separately within each section H(z) of human experience, and by requiring 
that this probability equal « within the whole human experience, the average 
length of the estimating interval can be reduced roughly by 12 percent 
of its original value. Naturally, this particular percentage is characteristic 
of the particular problem considered. The important point to remember is 
that, outside of the category of cases exemplified by the above situation of 
Mr. John Frederick Smith, the modernized Bayes’ estimating intervals will 
ordinarily be, on the average, shorter than the classical ones corresponding 
to the same frequency of successful estimation. 

As we have already mentioned, the problem of the modernized Bayes’ 
estimating intervals is akin to the theory treated by Wald, Wolfowitz, Gir- 
shick and others. However, the theory that is treated by these authors is 
much deeper and refers to the case in which the a priorz distribution of the 
estimated parameter is uncertain. Also this theory is mainly concerned 
with point estimation.? 





a=.8 


DIFFICULTIES CAUSED BY UNCERTAINTY REGARDING THE A PRIORI DISTRIBUTION 
AND ATTEMPTS TO CIRCUMVENT THEM 


As we have already mentioned, all the above discussion applies to cases 
where the a priori distribution of the estimated parameter is exactly known. 
This distribution must be implied by the conditions of the particular prob- 
lem under consideration. Cases of this kind exist, particularly in genetics 
where the postulate of the Mendelian Law implies everything, the random- 
ness of the observable variables, the class to which their distribution belongs, 
the randomness of the estimated parameters and their a prior distribution. 

Unfortunately, situations of this nature are extremely rare and, in prob- 
lems of a more common type, various difficulties arise. The randomness 
of the estimated parameter requires a postulate entirely independent of 
the one which concerns the observable random variables. Moreover, on 
occasion one feels reluctant to admit that the parameters are random vari- 
ables. Finally, even if the randomness of the parameters is postulated, 


2 For illustrations of the problems currently treated, see the article by J. L. Hodges, 
Jr., and EK. L. Lehmann: “Some problems in minimax point estimation,” Annals of Math. 
Stat., Vol. 21 (1950), pp. 182-197. 
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their distribution is unknown so that the conditions of the practical problem 
considered do not include information about the nature of the function 
W(0;, 02, -**, 8,) which plays an important role in formula (2). Strictly 
speaking, then, in cases of the kind described, the formulae discussed above, 
giving the most probable value of the parameter and the classical or the 
modernized Bayes’ estimating interval, are not applicable because of lack 
of necessary data. 

This embarrassing circumstance was noticed quite some time ago and 
there have been various attempts made to overcome the difficulty. Most 
of these attempts have the same theoretical weakness: they are not solutions 
of a mathematical problem using the data which are directly implied by 
the practical problem considered; instead, they are excuses or alibis for 
applying the attractive formula (2) even though the conditions of the 
problem studied do not provide the necessary data to substitute in formula 
(2). Naturally, these theoretical weaknesses are accompanied by corre- 
sponding practical defects. 

The first attempt to obviate the difficulty caused by the lack of infor- 
mation regarding the probabilities a priori consisted in the formulation of 
the so-called “principle of insufficient reason.” Roughly, this principle 
asserts that, whenever there is no good reason to believe that some par- 
ticular possible values of the estimated parameter are more probable than 
others, then it is legitimate to substitute in formula (2) 


W(0,, 02, -++, 0s) = C = constant. 


There are no laws, as yet, prohibiting the calculation of any formulae, and 
I would be the last to suggest that such laws should be introduced. Thus, I 
have not the slightest intention of questioning the legitimacy of the substitu- 
tion suggested. On the other hand, I wish to point out that, in cases where 
the conditions of the actual problem do not imply W (64, 62, --+, 6;) = C, and 
where the substitution of C instead of %(6,, 62, --+, 6.) 1s made on the basis 
of the principle alone, the results of further calculations using formula (2) 
need not have the clear frequency interpretation discussed above. In par- 
ticular, the most probable value of the parameter computed using the prin- 
ciple of insufficient reason need not coincide with the value of 6 which is 
most frequent in the sequence of cases H(x). Furthermore, the estimating 
interval computed using the principle of insufficient reason need not contain 
the true value of @ in the stated proportion a of the sequence of cases H (x). 

This point is well illustrated in the example described above. Following 
the principle of insufficient reason, we should put m = 1 which, with n = 4, 
would lead to the conclusion that the a posteriort most probable value of 
6 is x, the greatest of the four values of the observable random variables 
given by observation. However, if it happens that the true a priori distri- 
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bution of 6 is the function mé”"—1! with m > 5, then, irrespective of the 
observed value of x, the value most frequently assumed by @ in any sequence 
H(x) will be unity. Furthermore, the presumed “most probable” value 
equal to x will be the least frequent value of 6. A similar disappointment 
would result from the application of the Bayes’ estimating interval based 
on the principle of insufficient reason. 

In addition to the above disadvantages, the principle of insufficient 
reason is difficult to apply when the set of the possible values of the 
estimated parameter is unbounded, for example when the parameter @ is 
capable of assuming any positive value 0<6< o, or any real value 
— 0 <0<-+. In cases of this kind the probability density function 
cannot be represented by a constant because of the restriction that the 
integral of the probability density function extended from —o to +o 
must be equal to unity. Strange as it may seem, some of the protagonists 
of the subjective theory of probability who adhere to the principle of 
insufficient reason, are not disturbed by this fact. 

In this connection a modification of the principle of insufficient reason 
should be mentioned. According to the new principle, formula (2) may 
be legitimately computed by substituting for the unknown © (6;, 02, --:, 0) 
some function, not necessarily a constant, but a special function invented 
for this particular purpose and representative of “the state of mind” of the 
statistician who lacks any knowledge of what the values of the parameters 
61, 02, ++, 6, might be. 

Thus, for example, when one deals with normally distributed variables 
having an unknown variance o”, in the absence of any definite information 
as to what the a priori distribution might be, it is suggested that one use 
the formula, say . 


Cc 
V7(c) eaters) 
Oo 


where c is a constant. 
The reason for suggesting this particular function seems to be the fol- 
lowing. Let ¢t be a positive number. If we try to answer the question of 
the relation between the probability of « < t and the probability of o > t, 
the suggested form of ¥,;(c) has the advantage of not providing any answer. 
In fact, treating ¥;(c) as the probability density of o over the whole range 
of possible values of o, from zero to +, we may attempt to compute the 
desired probabilities by taking the integrals 
t 


’ 


‘da 
PU0<e<t=cf = =clogo 
0 ¢ 0 





(ee) 





“do 
P{t<a< +o} =of —=cloge 
zt 


o t 





























STATISTICAL ESTIMATION 183 


It happens, however, that both of these integrals diverge and hence that 
there is no real number representing either of them. Thus, it is impossible 
to answer the question whether it is more probable that o < t or that o > t. 
Allegedly, this corresponds exactly to our state of mind regarding o, namely, 
to the complete lack of knowledge regarding its value. From this point of 
view, one might regret perhaps that the integral of ¥;(c) taken between 
any positive limits, 0 < a < b, converges so that, for example, it appears 
possible to compare the probabilities P{.1<o< .2} and P{1<«a< 2} 
and to find them equal. This circumstance does not seem to be consistent 
with the complete ignorance of the value of « which was postulated. 

A much pleasanter attempt to deal with the lack of probabilities a priori 
consists in an effort to estimate them empirically. This method was used 
by many authors but recently it was explicitly advocated by R. v. Mises.’ 
We may illustrate its use and also its shortcomings on the two examples 
mentioned at the beginning of this conference. Thus the doctor who spe- 
cializes in treating patients with an excessive content of chemical A in their 
blood may keep records of his determinations of chemical A. According to 
some method, perhaps similar to the one used by Mrs. Tang in estimating 
the distribution of the true sugar excess in varieties of sugar beet, the 
doctor may establish a function, say Yy(7), which represents approximately 
the true distribution of y in the population of persons ill with the particular 
disease, of whom his office patients, in the course of the last year or so, were 
a sample. He may then use this function for purposes of estimating y 
during the following year. 

Undoubtedly, this method of approach is far more realistic than the 
invention of a priorz distributions without any recourse to actual phenomena. 
If it happens both that the population of ill persons does not change from 
one year to another and that the growing reputation of the doctor does not 
produce a change in the recruitment of his patients, then the function 
representing the probability density of 7 in one year will be valid for the 
next, and the doctor’s adjustment of the dose of the drug B will not be 
more inaccurate than expected. However, it is common knowledge that 
the conditions of health change from one year to another and from one 
vicinity to the next and it is these changes that are the danger points of 
the doctor’s proposed procedure. In fact, conditions of health may be 
presumed just as variable in time as the conditions of breeding studied by 
Mrs. Tang in 1937. Also, there is another danger, connected with the 
unavoidable inaccuracies in estimating the a prior distribution of y using 
past experience. To illustrate this point, I call your attention to Figure 1 
and to Figures 2 and 3. It is not impossible that the unavoidable random 


3 Richard von Mises: “On the correct use of Bayes’ formula.” Annals of Math. Stat., 
Vol. 13 (1942), pp. 156-165. 
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errors involved in using past experience may result in assuming that the 
a priort probability density function of @ is represented by %,(@) whereas, 
in actual fact, the true probability density function of this parameter is 
W.(6). Figures 2 and 3 show, then, that the a posteriori conclusions based 
on ¥,(@) will be very different from the realities implied by W2(6). 

While the suggested procedure of estimating the a priori distribution 
using past experience involves some dangers, still it is applicable in all cases 
where the same problem of estimation appears again and again and pro- 
vides the opportunity of collecting a reasonable amount of data. The first 
example used in this conference illustrates a category of problems in which 
the procedure is not applicable. In fact, while in recent decades, the gov- 
ernments of all civilized countries have made repeated attempts to study 
various phases of their economy, including farming, the number of obser- 
vations made in the past is plainly insufficient to provide any sort of 
approximation to the a priort distribution of a characteristic like the hypo- 
thetical characteristic € of the totality of farms. In addition, it is well 
known that the economic processes are rather rapid and the totality of farms 
in 1950 is a population entirely different from that in 1940 or in 1930. 
These populations and their characteristics are external marks of the cur- 
rent economic development with its periods of booms and recessions, and 
this is just the reason why, short of a comprehensive probabilistic theory 
of national economy, I personally am reluctant to consider € as a random 
variable. The postulation of a definite probability distribution of é would 
seem to be even less appropriate. 

Studies of populations are usually made on relatively large samples, cer- 
tainly in hundreds and frequently in thousands or in tens of thousands. 
Cases of this kind have inspired two great mathematicians, S. Bernstein 
and, apparently somewhat later, R. v. Mises, to prove a very interesting 
theorem regarding the properties of the a posteriori distribution when the 
number of independent observations is indefinitely increased. 

Bernstein obtained his result some time before 1915. In fact, in 1915 he 
described it in his lectures on probability which I had the good fortune to 
attend. R.v. Mises published his result in 1919+ in Mathematische Zeit- 
schrift. It must have been proved a few years earlier, thus, at about the 
same time as S. Bernstein’s result. Both results are to the general effect 
that, when the a prior distribution of a given parameter and also the dis- 
tribution of the observable random variables satisfy certain conditions of 
regularity, then the standardized a posteriori distribution of the estimated 
parameter, given n independent observations, tends, as n — o, to the normal 
distribution with zero expectation and unit variance. Thus, no matter 


* Richard von Mises: “Fundamentalsitze der Wahrscheinlichkeitsrechnung.” Math. 
Zeit., Vol. 4 (1919), pp. 1-97. 
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which particular function W satisfying the conditions of regularity we sub- 
stitute into formula (2), if n is sufficiently large, the results of computing 
formula (2) will be approximately the same. Namely, whatever a < b, 
the a posterior: probability that the difference between the true value of 6 
and its a posteriori expectation will be between agg and bog will be approxi- 


mately equal to 
1 AME 
—2*/2 
e dx 
V ar J ' 


where op” denotes the a posteriori variance of 0. 

It should be mentioned that the original paper of R. v. Mises did not 
enumerate the restrictions needed for the above conclusion. This gap was 
later filled by J. Hosiasson.® 

While the Bernstein-v. Mises theorem is a very interesting result, reveal- 
ing important properties of the a posteriori distributions, it has definite 
shortcomings if it is treated as the basis for extensive applications of the 
Bayes’ formula. First of all, many problems of estimation arise in which 
the number n of observable random variables is small and it is more or 
less hopeless to rely on an asymptotic result based on passage to the limit 
with n—> o. Second, the Bernstein-v. Mises theorem requires that certain 
regularity conditions be satisfied, and it happens that these conditions are 
far from being met universally. For example, they are not satisfied in 
the example of n independent variables following distribution (8). As 
a result, the a posteriorz distribution of @ is positive only on the interval 
(z,1) and, for large values of n, is monotonically decreasing as 6 varies from 
xto1. Obviously, it cannot be made to approach a normal limiting distri- 
bution by a mere process of standardization. 

The third shortcoming is somewhat delicate. In order to explain it I call 
your attention to the description of the theorem given above: “no matter 
which particular function © satisfying the conditions of regularity we sub- 
stitute in formula (2), if m is sufficiently large, then . . .” [This statement 
is made on the premise that the appropriate conditions of regularity are 
also met by pa(21, 2, °**, tn | 61, ***, O5).] Thus, the theorem asserts that 
for every function W of the specified category there exists appropriately 
large values of n with which the true a posteriori distribution (standardized) 
differs but little from the normal law. 

However, the theorem does not assert that sufficiently large values of n 
can be found such that, whatever a priori distribution of the specified broad 
category we take, the difference between the true a posterior distribution 





5 Janina Hosiasson: “Quelques remarques sur la dépendance des probabilités a pos- 
teriori de celles a priori.” Comptes-rendus, Premier Congrés des Math. des Pays Slaves, 
Warszawa, 1929 (1930), pp. 375-382. 
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and the normal limit can be neglected. The Bernstein-v. Mises theorem — 


does not assert this and the assertion is not true. It follows that, in spite 
of all its interest, this theorem cannot be considered as a universal justifica- 
tion for the use of Bayes’ formula in all cases where the a prior: distribution 
of a parameter is uncertain. 

In the history of methods of estimation, the two principles of best un- 
biased estimates and of maximum likelihood estimates play a role apart. 
Both were used by Gauss and by many authors thereafter. The principle 
of best unbiased estimates was never formulated unambiguously as a prin- 
ciple but simply came into frequent use, partly because it is easily applied 
in a broad category of cases and partly because it has important advan- 
tages as proved by Gauss and later popularized by Markoff. 

The principle of maximum likelihood was definitely proclaimed as a prin- 
ciple. This was done by R. A. Fisher in a number of his writings from 
which I shall give you a few quotations. However, probably feeling the 
weakness of a dogma, Fisher ° tried to support the dogma by rational argu- 
ments. In this he was very successful and guessed a number of important 
properties of the maximum likelihood estimates. Under suitable restrictions 
these properties were subsequently proved, with increasing rigor and gener- 
ality, by Harold Hotelling,’ J. Doob ® and, finally, A. Wald.°® 

Thus, the maximum likelihood estimates will be considered from two 
different points of view. First, we shall consider them from the point of 
view in which the principle of maximum likelihood is understood to be a 
command to use these particular estimates for the sole reason that they 
maximize the likelihood function. The second time we shall consider the 
use of the same estimates prompted not by the principle but by an under- 
standing of the properties which they possess in certain specified cases. 
In this, the maximum likelihood estimates will appear to play a role some- 
what similar to that of best unbiased estimates, depending upon the prior 
solution of the problem of estimation by interval, i.e. upon the solution 
which is independent of any assumption regarding the probabilities a priori. 

Here is a word of warning. The justification for the use of best unbiased 
and maximum likelihood estimates just mentioned is merely a justification. 
It is not intended to suggest that this is the only justification possible. We 


6R. A. Fisher: “On the mathematical foundations of theoretical statistics.” Phil. 
Trans. Roy. Soc., London, Ser. A, Vol. 222 (1922), pp. 309-368. 

7 Harold Hotelling: “The consistency and ultimate distribution of optimum statistics.” 
Trans. Am. Math. Soc., Vol. 32 (1930), pp. 847-859. 

8 Joseph Doob: “Probability and statistics.” Trans. Am. Math. Soc., Vol. 36 (1934), 
pp. 759-775. 

9 A. Wald: “Note on the consistency of the maximum likelihood estimate.” Annals of 
Math. Stat., Vol. 20 (1949), pp. 595-601. 
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shall discuss this in the next conference after presenting the non-Bayes’ 
solution of the problem of estimation by interval. 

Fisher’s dogmatic attitude towards maximum likelihood estimates may 
be illustrated by the following quotations, in which the more relevant pas- 
sages are italicized. 


The rejection of the theory of inverse probability [of the use of Bayes’ formula with 
an invented a prior distribution: J. N.] was for a time wrongly taken to imply that 
we cannot draw, from knowledge of a sample, inferences respecting the corresponding 
population. Such a view would entirely deny validity to all experimental science, 
What has now appeared is that the mathematical concept of probability is, in most 
cases, inadequate to express our mental confidence or diffidence in making such infer: 
ences, and that the mathematical quantity which appears to be appropriate for measur- 
ing our order of preference among different possible populations does not in fact obey 
the laws of probability. To distinguish it from probability, I have used the term 
“Likelihood” to designate this quantity; .... (R.A. Fisher: Statistecal Methods for 
Research Workers. 11th ed. Oliver and Boyd, London, 1950, p. 10.) 

The fact that the concept of probability is adequate for the specification of the nature 
and extent of uncertainty in these deductive arguments is no guarantee of its adequacy 
for reasoning of a genuinely inductive kind. ... More generally, however, a mathe- 
matical quantity of a different kind, which I have termed mathematical likelihood, 
appears to take its place as a measure of rational belief when we are reasoning from 
the sample to the population. (R. A. Fisher: “The logic of inductive reasoning.” Jr. 
Roy. Stat. Soc., Vol. 98 (1935), p. 40.) 


These quotations illustrate a difference between Fisher’s attitude towards 
probability and my own. For Fisher, probability appears as a measure 
of uncertainty applicable in certain cases but, regretfully, not in all cases. 
For me, it is solely the answer to the question, “how frequently this or that 
happens.” 

Now, here are a few quotations from Fisher illustrating his non-dogmatic 
attitude toward the principle of likelihood. 


Obviously the claim that the likelihood possesses these properties, and provides a 
rational basis for exact inference, can only be made in the light of a theory of estimation 
applicable to finite samples. In (2)1° I have developed such a theory, and have 
demonstrated that the most likely value of z, that is, the particular estimate found by 
the method of maximum likelihood, possesses uniquely those sampling properties which 
are required of a satisfactory estimate. (R. A. Fisher: “Inverse probability and the use 
of Likelihood.” Proc. Cambridge Phil. Soc., Vol. 28 (1932), pp. 257-261.) 


Here, then, there is no contention that the likelihood function is in itself 
a measure of confidence in a given value of a parameter. On the other 
hand, it is claimed that it is advantageous to use the maximum likelihood 
estimates because they have some desirable properties. Some of the desir- 
able and undesirable properties of an estimate are described as follows. 


10R. A. Fisher: “On the mathematical foundations of theoretical statistics.” Phil. 
Trans. Roy. Soc., London, Ser. A, Vol. 222 (1922), pp. 309-368. 


188 MATHEMATICAL STATISTICS AND PROBABILITY 


If we calculate a statistic, such, for example, as the mean, from a very large sample, 
we are accustomed to ascribe to it great accuracy; and indeed it will usually, but not 
always, be true, that if a number of such statistics can be obtained and compared, the 
discrepancies among them will grow less and less, as the samples from which they are 
drawn are made larger and larger. In fact, as the samples are made larger without 
limit, the statistic will usually tend to some fixed value characteristic of the population, 
and, therefore, expressible in terms of the parameters of the population. If, therefore, 
such a statistic is to be used to estimate these parameters, there is only one parametric 
function to which it can properly be equated. If it be equated to some other parametric 
function, we shall be using a statistic which even from an infinite sample does not give 
the correct value; it tends indeed to a fixed value, but to a value which is erroneous 
from the point of view with which it was used. Such statistics are termed Inconsistent 
Statistics; except when the error is extremely minute, as in the use of Sheppard’s adjust- 
ments, inconsistent statistics should be regarded as outside the pale of decent usage. 
(R. A. Fisher: Statistical Methods for Research Workers. 11th ed. Oliver and Boyd, 
London, 1950, p. 11.) 


With this preference of Fisher not to use inconsistent statistics, I per- 
fectly agree. When one intends to estimate a parameter 6, it is definitely 
not profitable to use an inconsistent estimate. 


Consistent statistics, on the other hand, all tend more and more nearly to give the 
correct values, as the sample is more and more increased; at any rate, if they tend to 
any fixed value it is not to an incorrect one. In the simplest cases, with which we 
shall be concerned, they not only tend to give the correct value, but the errors, for 
samples of a given size, tend to be distributed in a well-known distribution . . . known 
as the Normal Law of Frequency of Error, or more simply as the normal distribution. 
The liability to error may, in such cases, be expressed by calculating the mean value of 
the squares of these errors, a value which is known as the variance; and in the class of 
cases with which we are concerned, the variance falls off with increasing samples, in 
inverse proportion to the number in the sample. 

Now, for the purpose of estimating any parameter, such as the centre of a normal 
distribution, it is usually possible to invent any number of statistics such as the arith- 
metic mean, or the median, etc., which shall be consistent in the sense defined above, 
and each of which has in large samples a variance falling off inversely with the size of 
the sample. But for large samples of a fixed size the variance of these different sta- 
tistics will generally be different. Consequently, a special importance belongs to a 
smaller group of statistics, the error distributions of which tend to the normal distribu- 
tion, as the sample is increased, with the least possible variance. We may thus separate 
off from the general body of consistent statistics a group of especial value, and these 
are known as efficient statistics. 


The researches of the author have led him to the conclusion that an efficient statistic 
can in all cases be found by the Method of Maximum Likelihood; that is, by choosing 
statistics so that the estimated population should be that for which the likelihood is 
greatest. (R. A. Fisher: Statistical Methods for Research Workers. 11th ed. Oliver 
and Boyd, London, 1950, pp. 11-14). 


Here, again, I agree unreservedly with Fisher that, when several con- 
sistent estimates of the same parameter are available, all tending to be 
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normally distributed, the one with the smallest variance is preferable to 
others. Consequently, whenever the method of maximum likelihood yields 
estimates which are both consistent and efficient, this circumstance (but not 
the principle) may be considered an inducement to use the maximum likeli- 
hood estimates. On the other hand, if and when the maximum likelihood 
estimates are either inefficient or are outside “the pale of decent usage” 
by being inconsistent, the suggestion to use them, merely because they 
maximize the “measure of rational belief when we are reasoning from the 
sample to the population,’ does not seem convincing. 

However, are there cases where the maximum likelihood estimates are 
either inconsistent or inefficient? Yes, there are. The conditions where the 
maximum likelihood estimates are both consistent and efficient are stated 
in the papers by Hotelling, Doob and Wald quoted above. If these condi- 
tions are not satisfied, then, (1) the maximum likelihood estimates need not 
be consistent and, (11) even if they are consistent, they need not be efficient. 
The following two examples demonstrating these possibilities are taken from 
the joint publication of Dr. E. L. Scott and myself.1* 

(1) Consider an increasing sequence of s series of measurements 2;;(1 = 1, 
2,°°°, 837 = 1,2, --:,). Assume that all the measurements are mutually 
independent and follow a normal law with the same variance o”. However, 
the quantity €; measured in the zth series of measurements is different from 
the quantity é; measured in the jth series. This is exactly the case where 
a fixed set of instruments is routinely used to measure different objects, 
perhaps characteristics of different stars. The joint probability density of 
all the observations is given by the formula 


1 N DY ey 62/20? : 
PE = ( ) e + 7 ; N = >s Nes. 
Zs) 





oV Qn 
In these circumstances it is frequently important to estimate o, the standard 
error of measurements appropriate to the instruments used. 

In Fisher’s terminology, the likelihood function of a set of parameters 
means simply the probability density function (or a multiple of it) in which 
the particular values of the observable random variables are fixed and the 
parameters play the role of arguments. Thus, in the particular case con- 
sidered, the likelihood function of the s + 1 parameters involved, namely, 
£1, &, °**, & and a, is, say, 

a 243 — &;)2/20? 

L = const. X « Ne 22 ig nl 
Given any system of observed values x,; (for 7 = 1, 2, +++, 857 = 1, 2, ++>, 
n;) of the random variables X;;, the maximum likelihood estimates of the 


11 J, Neyman and Elizabeth L. Scott: “Consistent estimates based on partially con- 
sistent observations.” Econometrica, Vol. 16 (1948), pp. 1-82. 
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parameters are those values £,, ¢, for which L is a maximum. You will 
easily verify that the maximum likelihood estimates are 


E= = Do ay = 2;. forz2 = 1, 2,--:,8 
Nj j=1 
and 
% 
s- [ED ay 


J a 


We shall be particularly interested in the Sawiee case where n; = 2 for 
i = 1, 2, ---, s. Then the square of the maximum likelihood estimate é? 
appears as a simple arithmetic mean, 
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The expectation of this quantity is 


of quantities 


E($S,7) = $o* 
and the variance, say, 
ot 
Vyse = > 
It follows that the variance of é? is 
ot 
Ve2=— 
2s 


and tends to zero as s is increased. Thus, as s is indefinitely increased, 67 
tends in probability to its expectation o7/2 and, consequently, ¢ tends in 
probability not to o but to the quantity o/+/2. It follows that, in this 
particular case, the maximum likelihood estimate of o is inconsistent. 

It may be said that the situation is trivial and that the bias in the estimate 
can be easily corrected by multiplying the estimate by 1/2. This is un- 
doubtedly true but it is beside the point. It will be observed that the product 

2 is not the maximum likelihood estimate of o and that the bias in ¢ does 
not tend to zero as the number s is increased. This is just the circumstance 
which the example is meant to illustrate. 

(ii) In order to illustrate the possibility of the maximum likelihood estimate 
being consistent without being efficient, we shall use an example, similar to 
the above, where an increasing sequence of s series of measurements of the 
same quantity are made but where the error variance may vary from one 
series to the next. Thisis, for example, the case where é stands for the velocity 
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of light measured by s different observers, each using different equipment. 
As previously, we shall assume that all the measurements z;;, for 7 = 1, 2, 

-+, 857 =1, 2, ---, nm, are independent and normally distributed about & 
so that their joint probability density function is 


ae = (ij — §)*/20;*. 
p= II ( =) Gi fe 
jaa \o¢V 2 ie 


As you will have no difficulty in verifying, the maximum likelihood estimate 
of & is the root of the equation, say 


A 2 N(x. a! ) 
(Gs =o ey (18 


where S;” has the usual meaning, 





nS" = > Gare i Le.) 
c= 
The equation determining the maximum likelihood estimate £ is complicated 
but can be solved numerically. 

In addition to equation (18), the paper just quoted studies a more general 

equation, say, r 

ere ee f) 
Die Ne es ae ee (19) 
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obtained from (18) by substituting an arbitrary weight w; for n;. It is shown 
that, under mild restrictions regarding o; and w;, the solution é of (19) is a 
consistent estimate of € and that its variance is, say, 
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It follows that the system of weights w; which minimize the variance of the 
estimate & are those for which 
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Wi 





—-U=0, for2 = 1, 2, -*:,98, 
Nn — 2 
or 


w,; = (n; — 2) X const. 
With these weights, equation (19) takes the form, 
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with the corresponding asymptotic variance of the solution equal to, say 
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On the other hand, the asymptotic variance of the maximum likelihood solu- 
tion is 








and is, generally, greater than V,,;. The two variances V,,, and V¢ coincide 
only if, for all 7 = 1, 2, ---, s, the number n; of measurements forming the 


ith series is the same, say n. Then U = n/(n — 2) and V = V,,4. Further- 


more, as s is indefinitely increased, the quotient V,,,/Vz need not tend to 
unity so that the asymptotic efficiency of £ is less than that of & To see this 
you may wish to study more closely the simple particular case where o; = 
og =+++= 0; = o (though when estimating ~ we are not aware of this fact), 
and where no;_1 = n’ and no; = n” for all z = 1, 2, ---, s. It is also con- 
venient to assume that s is an even number, say s = 2m. You will see that, 
in this particular case the quotient V,,./V¢ has a value independent of m 
and less than unity. Thus, it is shown that, even if the maximum likelihood 
estimate is consistent, it need not be efficient and that, on occasion, consistent 
and asymptotically normal estimates are easily constructed with variances 
smaller than that of the maximum likelihood estimate. In the next con- 
ference I shall attempt to show that smallness of the variance combined with 
consistency and asymptotic normality of an estimate means a substantial 
advantage in terms of consequences of the systematic use of a given estimation 
procedure. For me personally, this constitutes a decisive argument against 
the principle of maximum likelihood treated as a principle in the strict sense 
of the word. The following quotation from Fisher seems to suggest that this 
would also be his opinion. 
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In the present paper I have been particularly concerned to show that all the proper- 
ties of mathematical likelihood, which make it valuable, can be demonstrated inde- 
pendently of any postulated value. From this it seems to me to follow that the concept 
of likelihood could be eliminated completely from discussions of estimation, and these 
discussions be adequately, though perhaps more cumbrously, carried out in other terms. 
(R. A. Fisher: “The logic of inductive inference.” Jr. Roy. Stat. Soc., Vol. 98 (1935), 
p. 81.) 


However, on the next page of the same paper we read: 


The fact that likelihood has been an aid to thought in such progress as has so far 
been made in the subject will suggest the advisability of using it for what it is worth, 
even though, ultimately, we may find ourselves able to do better. That there are logical 
situations in which the uncertainty of our inferences is expressible in terms of likelihood, 
but not in terms of probability, is one solid step gained, even though more compre- 
hensive notions may later be developed. (R.A. Fisher: “The logic of inductive infer- 
ence.” Jr. Roy. Stat. Soc., Vol. 98 (1935), p. 82.) 


This passage requires comments from two different points of view. First, 
I wish to point out that Fisher’s presumption that the mathematical con- 
cept of probability is inadequate when we are faced with the problem of 
estimation is based solely on his (quite correct) realization that, when a 
prior. probabilities are not available (which he presumed to be always the 
case and which I agree is almost always the case), then the formula of 
Bayes is not applicable. Now, the general inadequacy of a concept is some- 
thing which requires proof and the fact that one particular use of a concept 
is inapplicable does not, by itself, prove that no other uses of the same 
concept are possible with which to create an adequate basis for the theory 
of estimation. In fact, as you will see in the next conference, a theory of 
statistical estimation was developed entirely within the classical theory of 
probability, a theory which uses no other concepts and is applicable without 
any reference to probabilities a priort. For some questions which will be 
discussed later, it is relevant that Fisher, when writing his paper of 1935, 
was still of the opinion that the theory of probability by itself is not ade- 
quate for treating problems of estimation. 

My other comment concerns the utility of the concept of likelihood as a 
measure of our confidence in the particular values of unknown parameters. 
If one attempts to answer the question “Why does Fisher think that in 
certain logical situations the likelihood is adequate to express our uncer- 
tainty?” one is forced to refer to passages like the last but one quoted. 
Here one finds contentions to the effect that the maximum likelihood esti- 
mates have properties which make the likelihood valuable and “which can 
be demonstrated independently of any postulated value.” This demonstra- 
tion appears to be on the ground of probability theory. Thus, the general 
argument is that the likelihood is an adequate measure of our confidence 
because the estimates obtained on this ground (or so Fisher thought) possess 


194 MATHEMATICAL STATISTICS AND PROBABILITY 


certain desirable probabilistic properties. In these circumstances, the con- 
tention that the likelihood is adequate in cases where the concept of prob- 
ability is not appears baseless and the references to likelihood as a measure 
of confidence contribute nothing but a certain amount of confusion. If 
Fisher’s presumption that the desirable probabilistic properties (consistency — 
and efficiency) are universally possessed by the maximum likelihood esti- 
mates were correct, then, except for this confusion of thought, the notion 
of the likelihood as a measure of confidence would not be harmful. How- 
ever, as things stand, the notion of the new measure of confidence is regret- 
table because it may mislead the credulous part of the consumers of statis- 
tical theory. 

All this applies to the notion of likelihood as a measure of confidence. 
On the other hand, there is no reason to object to the use of the label 
“likelihood function” applied to the probability density of the observable 
random variables with fixed particular values of these variables, considered 
as a function of the parameters. 

To sum up: whenever the conditions of a particular problem imply that 
an unknown parameter is a random variable with a specified distribution 
a priori, then the formula of Bayes provides a clear cut solution of the 
problem of estimation; this solution, either in the form of a single estimate 
or in the form of an estimating interval, classical or modernized, has a 
simple interpretation in terms of frequencies of successes in estimating the 
unknown parameter; when the conditions of the practical problem consid- 
ered do not imply the a priori distribution of the estimated parameter, then 
it is still “legitimate” to use the formula of Bayes; however, notwithstanding 
the theorem of Bernstein-v. Mises and the attempts to estimate the a priori 
distribution from past experience, such applications of Bayes’ formula have 
a doubtful frequency interpretation; finally, it appears unprofitable to 
adopt the principle of maximum likelihood (and this also applies to the 
principle of insufficient reason and to its more recent modifications) because 
cases exist in which a strict adoption of this principle would lead to excess- 
ively frequent large errors in estimation that are perfectly avoidable. 


Part 2. Outline of the Theory of Confidence Intervals 


(Based on a conference held in the auditorium of the United States Department of 
Agriculture, April 9, 1937, 10 a.m., Dr. Frederick V. Waugh presiding.) 


This morning I shall resume the outline of the problem of statistical esti- 
mation at the point where I stopped yesterday. You will remember that 
our discussion ended with the general conclusion that the classical approach 
by means of the theorem of Bayes provides a satisfactory solution only in 
the exceptional cases where the a priori distribution of the estimated param- 





STATISTICAL ESTIMATION 195 


eter is known. In all other cases we have at best approximations of un- 
known precision, and at worst gross misconceptions dressed in impressive 
phraseology. 

My purpose this morning is to explain a new method of approach to the 
problem of estimation especially designed for all cases in which the a priori 
distribution of the estimated parameter is not known and where, therefore, 
the estimated parameter may be treated as an unknown constant, not as a 
random variable. The theory I am going to present is known as the theory 
of confidence intervals. The first outline of this theory appeared in my 
paper? of 1934. A more thorough treatment is found in two subsequent 
memoirs, one in English? and the other in French.? However, the first 
reference to confidence intervals appeared in 1932 in a monograph‘ of 
Wactaw Pytkowski, then a student of mine, who applied the new theory 
to the problems of estimation of various characteristics of small farms in 
Poland. 

When approaching the practical problem of estimation in cases where 
no information about the a prior: distribution is available, it 1s important 
to realize that the corresponding mathematical problem should be stated in 
a form which is essentially different from the form leading to Bayes’ solu- 
tion. The Bayes’ solution answers the following question: (a) given that 
the observable random variables X1, X2, °+*, Xn have assumed the specified 
values X1, X2, ***, Ln, what is the probability, 


Peay Ss b | (Xy = 2)(Xeq = 2X2) +++ (Xn = 2n)}, 


that the estimated parameter 6, will have a value contained between the 
specified limits a<b? As we have seen, the answer to this question 
depends upon the a priorz distribution of 6, and, if this distribution is not 
known, question (a) cannot be answered. Thus, if a solution of the prac- 
tical problem of estimation is to be based on the theory of probability, it 
will be necessary to formulate a new problem, say (b), different from (a). 
Problem (b) must be such that its solution will not depend upon the a priori 
distribution of the estimated parameter and, at the same time, will give an 


1J. Neyman: “On the two different aspects of the representative method: the method 
of stratified sampling and the method of purposive selection.” Jr. Roy. Stat. Soc., Vol. 
97 (1934), pp. 558-625. 

2J. Neyman: “Outline of a theory of statistical estimation based on the classical 
theory of probability.” Phil. Trans. Roy. Soc., London, Ser. A, Vol. 236 (1937), pp. 
333-380. 

3 J. Neyman: “L’estimation statistique traitée comme un probléme classique de pro- 
babilité.” Actualités Scientifiques et Industrielles, No. 739 (1938), pp. 25-57. 

4 Waclaw Pytkowski: “The dependence of the income in small farms upon their area, 
the outlay and the capital invested in cows.” Brbljoteka Putawska, No. 34 (1932), 
Warszawa, 59 pp. -+ 4 tables. 
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intelligible answer to the difficulty facing the practical statistician. It 
appears that both the formulation of the new probabilistic problem (b) and 
its solution are very simple and the fact that they were not found for a 
long time must be ascribed to what Karl Pearson called “routine of thought” 
and to attachment to the formula of Bayes. The scholars must have been 
so impressed by Bayes’ formula that they just did not think of thinking 
about the problem in a different manner. However, as we shall see later, 
the elements of the new idea can be discovered in the writings of many 
earlier authors, beginning with Gauss. Unfortunately, these elements of 
thought, for some reason, took quite a long time to grow and to crystallize. 

We shall begin by recalling what exactly the practical statistician does 
when he is faced with a problem of estimation and what exactly he needs 
from the theory. In this, our attention will be primarily directed towards 
the problem of estimation by interval and we shall have to return to the 
ideas described in the early part of yesterday’s conference. 

We contemplate a situation in which the practical statistician is interested 
in the value of the parameter 6; that appears in the probability density 
function pa(X1, V2, ***, In| 61, 02, °**, 05) of n observable random variables 
X1, Xe, °**, Xn. The analytical form of this probability density function 
is known to the statistician, but the values of the parameters 6;, 62, --*, 4s 
are unknown, except that they are contained in some specified intervals, 
say A; < 6; < B, (1 = 1, 2, -:-, 8), finite or infinite. The practical statis- 
tician is faced with the necessity of taking an action which should be 
adjusted to the value of the parameter 6;. If he is the M.D. of the second 
example in yesterday’s conference, the action contemplated consists in 
administering to the patient a dose of drug B, a dose which should be appro- 
priately adjusted to the content 7 of substance A in the patient’s blood. 
If the practical statistician is concerned with the policy of the Department 
of Agriculture, his contemplated action may consist in suggesting a pro- 
vision in a forthcoming bill, a provision which should be adjusted to the 
characteristic é of the totality of farms in the United States as mentioned 
in the first example of the last conference. Unfortunately, neither é nor 7 
can be evaluated exactly and the best the two practical statisticians can 
do is to observe particular values of the random variables X,, Xo, :-:, Xn 
and base their actions on these observations. In each case the assertion 
about the true value of the unknown parameter 6, (€ in one case and y in 
the other) will be made in the same form, 


6(Xy, Xo, at” thy Xn) Ss 3s (X41, Xo, ede: Xn), (1) 


where @ and @ are functions of the observable random variables. Then the 
practical statistician will adjust his actions as if it were known for certain 
that the true value of 4; is contained between the limits indicated. 
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All human actions are subject to error and the actions of the practical 
statistician cannot be an exception to the general rule. Thus the practical 
statistician must be aware that, whatever function @ and @ he selects, his 
assertions about the value of 6, will be erroneous from time to time. The 
best he can hope to arrange is that the errors of estimation do not occur too 
frequently. Also, he may have in mind a scale of importance of different 
errors. For example, in a particular case an overestimate of the parameter 
may be more important to avoid than an underestimate. Finally, the practi- 
cal statistician is likely to desire that the difference 6 — @ be, in general, as 
small as possible. However, the most pressing need which the practical 
statistician is likely to feel is that he be given the opportunity to select a 
number a, 0 < a < 1, just as close to unity as he desires, and to determine a 
pair of functions 6(X1, Xa, --+, Xn) and 6(X 1, Xo, ---, Xn) such that their 
use to estimate 6; in the manner described will yield correct results with the 
long-run relative frequency equal to a or, if this is impossible, at least equal 
to a, 

As a general result of this discussion, we can now formulate the mathe- 
matical problem (b) referring to the problem of estimating a parameter 6, 
which in the modern form of theory of estimation takes the place of problem 
(a) discussed above. 

Problem (b). Given that the observable random variables HK = (X1, Xo, -°- 
Xn) follow a distribution with the probability density function pr(x1, Xe, ---, 
ore | 61, 02, +++, 05) depending on s parameters 61, 82, +++, Os, the values of which 
are unknown; given also that the parameter 0; may have any value between the 
specified limits A; < 0; < B; fori = 1, 2, +--+, 8; finally, given a number a 
between the limits 0 < a < 1, to determine two functions 6(x1, %2, +++, Ln) and 
(21, V2, ++ *, Un) defined over all possible systems of values x1, X2, +++, Xn of the 
observable random variables, such that for all possible systems of values of the 
parameters 01, 62, +++, As 


P{Q(X1, Xo, pe Xn) =o = (Xi, Xo, ope, Xn) | 01, 82, °°, 05} ==. (2) 


p) 


It is essential to be entirely clear about the implications of the requirements 
imposed on the two functions @ and 6. You will notice the sign of identity 
= appearing in formula (2). This sign emphasizes the requirement that the 
probability on the left be equal to @ irrespective of what value 6, takes be- 
tween A, < 6, < Bj, and irrespective of the values of the other parameters 
2, 03, -++, 6s. Thus, in particular, if unity, two and three are the possible 
values of 6;, it is required from the functions 9 and @ that 


P{9(Xi, X2, ees '9 a) Sees 0(X4, Xo, ask) Xn) | (01 = 1); Oo, 3+, Os} = a, 
P{0(X1, Xo, Sa.) = 2 = O(X1, X2, er) | (0; a 2), 8, rE: Os} = a, 
P{9(Xi, Xo, pli, = af (X41, Xo, Oe Xn) | (0; aa 3), 62, keh oo) Os} = a, 
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etc., where the identity signs refer to the possibility of variation in the values 
of 02, 63, «++, 6, and require that the probability on the left hand side keeps 
the same value a, irrespective of changes in the values of 62, 03, «++, 0.. In 
other words, it is required of the functions 9 and @ that, if 6; = 1, they bracket 
unity with the prescribed frequency a. If 6, = 2, they are required to bracket 
2 with the same frequency a, etc. 

It is seen that our requirements regarding the functions 6 and @ are tricky 
and, at least at first sight, one does not know quite where to begin to satisfy 
them. However, it was possible to prove that the problem of determining 
6 and @ reduces to another problem of a more familiar nature. Now let us 
adopt the following definition. We denote by a a fixed number between zero 
and unity. 

If two functions 6(X1, Xo, +++, Xn) and 6(X1, Xo, +++, Xn) of the observable 
random variables X1, Xo, °++, Xn, defined over all possible systems of values of 
these variables, possess the property that, whatever the possible value of the para- 
meter 61, the probability (2) of 6(X1, X2, -++, Xn) falling short of 6, at the same 
time that 6(X 1, Xo, +++, Xn) ts at least equal to 6, equals a identically in 62, 63, 

-+, 6s, then we shall say that 6 and 6 are the lower and the upper confidence 
limits for 0, corresponding to the confidence coefficient a. 

Furthermore, the interval [0(X1, Xo, ---, Xn), 0(X1, Xo, +++, Xn)] will be 
called the confidence interval for 6; corresponding to the confidence coefficient a. 

In an earlier part of this book, we have used the terms sample point and 
sample space. If x1, 22, -++, Xn are possible values of the observable random 
variables X1, Xo, --+, Xn, then we say that the system of m numbers (2, Zo, 

-+, 2) determines (or represents) a possible sample point. The set of all 
possible sample points is called the sample space and is denoted by W. If n 
does not exceed 3, then the sample points and the sample space are easily 
interpreted in the space of the appropriate number of dimensions and are 
easy to visualize. If nis greater than 3, diagrammatic presentation is impossi- 
ble but it is still convenient to speak in terms of points and spaces. 

We shall now indicate how the search for confidence limits is reduced to the 
search for certain regions in the sample space, called regions of acceptance. 
For this purpose assume for a moment that the confidence limits @ and @ 
have already been found and correspond to a confidence coefficient a, previ- 
ously selected, 0 < a < 1. Consider a space G (general space) of n + 1 dimen- 
sions. Of the n + 1 axes of coordinates in this space, the first » will corre- 
spond to the n observable random variables. In other words, the possible 
values of X, will be measured on axis Oz, the possible values of X2 will be 
measured on axis Ox2, etc. On the last (the n + 1-st) axis of coordi- 
nates in G, we shall measure the possible values of the estimated parameter 
6,. We shall imagine that this last axis 00, is vertical. Now, select any 
possible sample point (21, 22, +++, %n). To this point, there will correspond a 
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value 9(21, Z2, +++, Xn) of the lower confidence limit and a value 6(2;, x2, ---, 
tn) of the upper confidence limit. Imagine that we plot the two points, 
[z1, T2) °°", Xn; (x1, ey eae Tn)] and [z1, T2, ***, In, A(x, ay ey In)I, and 
connect them by a line. This line or, rather, this interval of line, call it 
5(21, X2, -*+, Xn), will represent the confidence interval corresponding to the 


Ficure 1 


General space and confidence intervals 





selected possible sample point. The situation is illustrated in Figure 1. 
Imagine that this procedure is repeated for each and every possible sample 
point. Now, take a possible value of 6,, say 6;’, and, in the general space G, 
consider a horizontal plane 6; = 0,’. Generally, this plane will cut some of the 
confidence intervals and will miss others. 

Denote by A(6;’) the set of all possible sample points such that the corre- 
sponding confidence intervals are cut by the plane 6, = 6;’._ In other words, 
the set A(6;’) is the set of all possible sample points that satisfy the double 
condition, 

(a1, oy) Baath | In) Ss 0,’ s A(x, LF; OT Tn). (3) 
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The set A(6,’) so defined is called the region of acceptance corresponding to 
6,’.. Thus, if 6 and @ are confidence limits for 6, corresponding to the con- 
fidence coefficient a, then they determine a set, say A, of regions of acceptance. 
Obviously, to each possible value of @,; there corresponds a region of accept- 
ance. . 

The regions of acceptance A(6,) and their set A possess the following prop- 
erties. ‘The important property of every particular region of acceptance, say 
A(6;’), is that if a possible sample point (x1, v2, +++, Yn) falls within A(6,’), 
then the corresponding confidence interval 6(x1, r2, --+, Yn) covers the value 
6,’ of 6, and vice versa. ‘This is an immediate consequence of the definition 
of A(6,’) by means of the double relation (8). In fact, in order to verify 
whether or not a sample point (x,’, 20’, ---, %n’) falls within A(6,’), it is 
sufficient to compute the values of 6 and @ corresponding to this point and 
see whether or not they bracket 6,;’.. But this is exactly what we would do 
in order to verify whether or not 6(2;’, v9’, -+-, tm’) covers 6;’. In order to 
express this by a formula, we shall agree to use the letter C to denote the word 
“covers” and the letter « to denote the phrase ‘‘is an element of” or “belongs 
to.” With this notation, we may write the identity of the two events, 


[E « A(6y’)] = [6(Z) C 6’, (4) 


where, as formerly, the letter H stands for the set of observable random 
variables X1, Xo, +--+, Xn. 

It follows from (4) that, whatever be the assumptions on which the prob- 
abilities are computed, 


P(E ¢ A(6y’)} = P{5(E) C 6’}. (5) 


In particular, if we compute the probabilities on the assumption that 6; = 6,’ 
while the other parameters 02, 03, ---, 0; have arbitrary values, we shall find 


P{E ¢€ A(6y’) | 61’; 60, 2%, 0.) = Beth) Coy | Oy’, 0g, >>, Os} Saray 


because of the definition of confidence intervals. Thus, if 0,’ is a possible 
value of 6; and A(6,’) is the corresponding region of acceptance, then, what- 
ever be the possible values of 0, 03, ---, 4s, 


P{E ¢€ A(6y') | 01’, 0, «++, 05} = a. (6) 


Identity (6) represents the necessary condition which a region A(6,’) must 
satisfy in order to qualify as a region of acceptance corresponding to the 
value 6;’ of the parameter 6;. In addition to this condition which applies 
to each and every region of acceptance A(6,’) taken separately, there are 
important conditions which apply to the whole set A of regions of acceptance. 
These conditions are intuitive and, therefore, I am going to enumerate them 
without giving proofs. ‘The proofs are given in detail in my paper of 1937, 
already quoted. 
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(I) The first condition that the set A of regions of acceptance must satisfy 
is that the union of all regions of acceptance, corresponding to all possible 
values of the parameter 6;, must coincide with the sample space W. In 
other words, whatever possible sample point we take, say (x1, v2, +++, Xn), 
there must exist at least one possible value of 6; such that its region of accept- 
ance includes this particular sample point. 

(II) The second necessary condition which the set A must satisfy is as 
follows. Let (21’, xo’, +++, @’) be an arbitrary possible sample point. Ac- 
cording to the above condition (I), there exists at least one possible value 
6;’ of 0, such that A(6;’) contains (2;’, xo’, +++, Xn’). Consider the set 
S(xy’, ®2', +++, Xn’) of all possible values of 6, such that the point (2,’, x9’, 
-++, Xp’) is contained within their regions of acceptance. Then condition 
(II) states that the set S(xy’, xo’, +++, tn’) fills a closed interval. This closed 
interval extends from 6(2y’, x9’, +++, Xn’) to A(z’, Xo’, +++, Ln’). 

It is easy to see that condition (6), applying to each region A (6;’) separately, 
and conditions (I) and (II), applying to the set A of all regions A(6,’), are 
necessary and sufficient for the set A to be the set of regions of acceptance. 
In other words, if we start with defining for each possible value 0,’ of 0; a 
region A(6,’) satisfying (6) and if we manage to adjust these regions so that 
their set A satisfies conditions (I) and (II), then this set A determines con- 
fidence intervals for 6, corresponding to the confidence coefficient a. In fact, 
suppose that (6) and (I) and (II) are satisfied by some regions, say B(6,’). 
Let (21, 22, -++, %n) be a possible sample point. According to (I) and (II), 
there exists a closed interval of possible values of 6,, extending from some 
value fi (x1, 22, -**, Zn) to some other value fo(71, x2, +++, Xn) such that, what- 
ever 0,’ between the limits, f; S 0,’ S fo, the point (x, x2, +++, Xn) belongs 
to the region B(6,’’). You will have no difficulty in verifying that the two 
functions f; and fo satisfy the definition of confidence limits for 6,. In fact, 
the interval between them, say A(x, v2, -*+, tn), covers any given value 
6;’ of 6; whenever (x1, 22, -+-, tm) belongs to B(6;’) and in no other case. 
On the other hand, since B(6;’) is supposed to satisfy condition (6), we have 


Pie e B(6;’) | 61’, Go tics Os} == a. 
Thus, 


P{A(E) C 0,’ | 6’, 02, «++, 02} = P{E € B(,’) | 01’, 02, +++, Os} =a. 


In this way we come to the conclusion that, in order to determine a pair 
of confidence limits, it is both necessary and sufficient to determine a 
family A of regions of acceptance satisfying the above three conditions. 

If the number s of unknown parameters involved in the probability density 
function of the observable random variables exceeds unity, there are sub- 
stantial difficulties in determining regions satisfying condition (6). Regions 
satisfying this condition are called similar to the sample space and there is 
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a substantial literature concerning them.’ On the other hand, if s = 1, the 
problem of satisfying condition (6) is trivial. Conditions (I) and (II) are 
also easy to satisfy and, generally, we have at our disposal a great variety 
of confidence intervals corresponding to the same confidence coefficient. In 
the simplest case, s = 1, the theory of confidence intervals is concerned 
mainly with the problem of an appropriate choice among the many con- 
fidence intervals. Naturally, in the case s > 1, the problem of choice also 
exists and presents more difficulties. Limitations of time and space make 
it impossible to discuss all of these matters in detail and I must refer you 
to the literature already quoted. The best that I can do here is to work 
out an example in order to illustrate the procedure of determining confidence 
intervals. In so doing, we shall have occasion to discuss the interpretation 
of confidence intervals and also to touch upon the problem of optimum. 

The example I shall use is the one discussed yesterday. This is the 
example of n observable random variables X1, X2, -++, Xn, all independent 
and having the same distribution with the probability density function equal 
to 1/6 for 0< x8 and zero elsewhere, 6 >0 being the parameter to be 
estimated. Yesterday I considered the particular case in which some defi- 
nite information regarding 6 was available. Namely, I assumed as known 
for certain that (a) 6 cannot exceed the limits 0 < 61 and (b) the 
frequency of cases where 6 falls within any interval (a, b) partial to (0, 1) 
is represented by the integral, 


b 
if mo” do = b” — a”, 


with a known value of m. In other words, I assumed in discussing this 
example that the a priorz distribution of 6 was known exactly. 

Today, contrary to this, I shall study the problem of estimating @ in 
conditions where nothing whatever is known about its value except that it 


5 See, for example, the following papers: 

(a) J. Neyman and E. §. Pearson: “On the problem of the most efficient tests of 
statistical hypotheses.” Phil. Trans. Roy. Soc., London, Ser. A, Vol. 231 (1933), pp. 
289-337. 

(b) W. Feller: “Note on regions similar to the sample space.” Stat. Research Memoirs, 
Vol. 2 (1938), pp. 117-125. 

(c) J. Neyman: “On a statistical problem arising in routine analysis and in sampling 
inspection of mass production.” Annals of Math. Stat., Vol. 12 (1941), pp. 46-76. 

(d) H. Scheffé: “On the theory of testing composite hypotheses with one constraint.” 
Annals of Math. Stat., Vol. 13 (1942), pp. 280-293. 

(e) E. L. Lehmann: “On optimum tests of composite hypotheses with one constraint.” 
Annals of Math. Stat., Vol. 18 (1947), pp. 473-493. 

(f) P. G. Hoel: “On the uniqueness of similar regions.” Annals of Math. Stat., Vol. 19 
(1948), pp. 66-71. 
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is a positive number. Of course, this unique assumption cannot be con- 
sidered as any sort of limitation to the generality of the problem, since the 
assumption 6 > 0 is implied by the nature of the assumed distribution of 
the observable random variables, with their probability density function 
positive within the interval (0, 6) and zero elsewhere. 

Thus, the difference in the conditions of the problem of estimation con- 
sidered yesterday and this morning is as follows: Yesterday I assumed 
definite information regarding the distribution of the observable random 
variables given the value of the parameter 6 and definite information regard- 
ing the distribution of @ considered as a random variable; this morning 
I shall treat the problem of estimating 6 when no assumptions are made 
regarding the a priori distribution of @ except the one implied by the infor- 
mation about the distribution of the observable random variables. Our 
problem will be to construct a system of confidence intervals for @ corre- 
sponding to a preassigned confidence coefficient «, say « = .90, « = .95, etc. 

Turning to the general theory outlined above, we must be clear about our 
aims and about the steps we have to take to attain these aims. Our aim is to 
define over the whole sample space W of the observable random variables two 
functions (X1, Xo, --:, Xn) and 0(X,, Xo, «++, X») having the property that, 
whatever be the (necessarily positive) value 6, of the parameter 0, the prob- 
ability that this value 6, will be bracketed by @ and @ is equal to a, 


P{O(X1, Xa, “ie Xn) SiS A(X, Xo, Stk ai) | = 6} =a (7) 


This, then, is our aim. The means to attain this aim, as indicated by the 
foregoing theory, is to take the following steps: 

(i) To determine the sample space W; 

(ii) For each possible value 6, of 6, i.e. for each positive number 6,, to select 
within W a region of acceptance A(6,) satisfying condition (6) and such that 
the totality A of such regions satisfy conditions (I) and (II). 


Then the boundaries of the set S(z1, 22, --+, tn) will represent the values of 
the functions 9 and @ corresponding to the possible sample point (7, 72, 
aoe a) 


I have already pointed out that in many cases not one but many dif- 
ferent systems of regions of acceptance are available. Naturally, each 
system of regions of acceptance determines a separate system of confidence 
intervals corresponding to the same confidence coefficient «. In order to 
illustrate this point, we shall select two systems of regions of acceptance, 
A and B, and examine the corresponding confidence intervals. 

According to the conditions of the problem, the probability density func- 
tion of the n observable random variables, X1, Xe, +++, Xn, is given by 
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1 
pa(m, ®2, °**,%m| 6) = = for 0 < 24, 22, -*+,%n SO 


(8) 


= 0 elsewhere, 


where 6 is some unknown positive number. Thus, function (8) is positive 
within a hypercube of n dimensions, with each dimension extending from 
zero to 6. Since 6 > 0 may be any number, every point with positive co- 
ordinates 21, 2, ***, XZ, is a possible sample point. It follows that, in this 
case, the sample space W is the set of all points in n dimensions with their 
coordinates x; > 0, for2 = 1, 2, --:,n. This statement completes step (i). 
Now, let us proceed to step (ii) and determine the system A of regions 
of acceptance. For this purpose, fix for a moment an arbitrary possible 
value 6, of 6, and denote by W(6,) the region partial to W determined by 

the inequalities, 
0<2; 5, fora ply srieten (9) 


Should it happen that 6, > 0 is the true value of 6, then within W(6,) the 
probability density function (8) of the observable random variables is positive 
and equal to 1/6,”, while outside of W(6,) it is zero. With the notation 
adopted, the symbol W[6,(1 — a)'/"] denotes a hypercube partial to W (6) 
with dimensions, 


0O<2;56,11—a)", - fori = 1,2, --+,n. (10) 


As the region of acceptance, A(6,), corresponding to the selected possible 
value 6, of 6, we shall select that part of the hypercube W(6,) which lies 
outside of W [6,(1 — a)*/"], with the inclusion of the outer boundary of the 
latter. In other words, A (6;) is defined to include every point (71, %2, +++, 2n) 
which satisfies condition (9) but fails to satisfy the condition 

0<2;<4(1—a)'", fori =1,2,---,n. (11) 
You will notice that (11) differs from (10) by the lack of the equality sign 
of the right. 

It is easy to see that region A(6;) satisfies condition (6). To see 
this, assume that 6, is the true value of 6 and compute the probability 
P{E «A(6;) | 6:}. Owing to the fact that the distribution of EZ within W (61) 
is uniform on the assumption just made, the probability, P{H « A(6,) | 6} is 
equal to the volume of A(6,) divided by the volume of W(6,). According 
to the definition of A(6;), 


Volume of A(#,) = Volume of W(6,) — Volume of W[4,(1 — a)1/"] 
0," '— [0:(l a)" 


n 


I 


= ad; 
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and it follows that 
P{E € A(61) 61} =— ca 


irrespective of the value 6; of 0. 

Thus, for every possible value 6; of 0, we have defined a region A (6) 
satisfying condition (6). In order to determine whether or not these regions 
can be regarded as regions of acceptance, we shall consider the set A of all 
these regions and verify whether or not it satisfies conditions (I) and (II). 
For this purpose it will be convenient to give the definition of A(@) an 
analytical form. You will notice that, in order to determine whether a 
given point (21, %2, **+, %) with positive coordinates falls within A(6) or 
not, it is not necessary to know the values of all of its coordinates. For this 
particular purpose, it is sufficient to know the value x of the greatest of 
the n coordinates 7, Y2, +, Lp. If 


(l1—a)"<2 8, (12) 


then the point (x1, %2, ***, %) belongs to A(6@). Otherwise, it does not. 
Thus, the double formula (12) represents the complete definition of the 
region A(6) corresponding to any specified 6 > 0. 

In order to verify that the set A satisfies conditions (I) and (II), fix an 
arbitrary possible sample point, i.e., a point with arbitrary positive coordi- 
nates 21, V2, ***, Xn, and determine the set S(%, 2, +++, %n) of values of 6 
for which this point (21, 22, ***,%n) «A(@). As previously, let x denote the 
greatest of the coordinates of the selected possible sample point. In order 
that this point belong to A(6@), it is necessary and sufficient that 6 satisfy 
the double condition (12). Solving this condition for 6, we obtain the 
double condition 

x 


(1 a a)i!n 


which defines the set S(21, 22, -++, %m). It is seen that the set S(x1, x2, -+-, 
Yn) extends over a closed interval beginning with x and ending with 
a/(1 — a). 

It follows that the set A of regions A(@) satisfies conditions (I) and (II) 
and that the corresponding confidence limits for @ are 


1 ts 6 


IIA 


(13) 


O(x1, 2) on) = 2, 


and 
x 


O(21, L2) ++") Ln) = Gaal 
The length of the confidence interval corresponding to the given point 
(x1, U2) °" a) is, say, 
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1—(1— a)!" 


6(11, Za, °°, In) = x (i a) t/n 


Let X be defined as the random variable whose value coincides with that of 
the greatest of the random variables X,, X2, ---, Xn. Then the substitution 
in 6 and 6 of X; instead of x; (for 7 = 1, 2, ---, n) and of X instead of z, will 
yield two random variables, 


O(X4, Xo, aly XK) a xX, 
(14) 


0(X1, X2, -++, Xn) = Gaal 

which have property (7) for all values of 0 > 0. In other words, whatever 
may be the true value of 6 > 0, the probability that the two functions (14) 
will bracket this value is exactly equal to a. Thus, the practical statistician 
who makes a rule of asserting that 6 is a number between the particular values 
of 6 and 6 as determined by observation is in a position exactly comparable 
to that of a gambler betting on the outcome of a game of chance, the prob- 
ability of which is equal to a which may be as close to unity as desired. 

We shall use the symbol 6(#) to denote the confidence interval determined 
by the two functions (14) and built by using system A of regions of acceptance. 
Now, we shall proceed to define an alternative system B of regions of accept- 
ance and the corresponding confidence interval, say A(£). 

Fix a tentative positive value 6, of 6 and define B(6,) to include all points 
of the sample space W such that the arithmetic mean ¢ of their coordinates 
differs from 6,/2 by not more than a quantity u(@,). Thus, B(6;) is defined by 
the double relation, 


Let X represent the random variable defined as the arithmetic mean of Xj, 
Xo, +++, Xn. The motivation for the above choice of the region B(@,) is 
that 6,/2 represents the expected value of X and also its most probable value. 
The quantity w(6,) must be so determined as to satisfy condition (6). Since 
a given sample point does or does not belong to B(6,) according as X does or 
does not satisfy condition (15), it is obvious that 


It follows that the value of w(@,) can be found by using the probability density 
function of X computed on the assumption that 6, is the true value of 6. 
The exact form of this probability density function was found by Laplace. 
If n = 2, then it is very simple, 




















STATISTICAL ESTIMATION 207 


px(é| 6) = 48 for0 S # = $A, 
= 4 (0; i Z) for 50, a cc = 61, 
=) elsewhere. 


On the other hand, as n is increased, the expression for px( | 6,) becomes 
more and more complicated and soon becomes unmanageable. It happens, 
however, that with very moderate values of n, say with n = 5, the probabilities 
computed using the true probability density function are already difficult to 
distinguish from their normal approximations. Since our purpose here is to 
deal with the conceptual, rather than with the numerical side of the problem, 
we shall use the normal approximation of the probability density of X. Using 
the fact that, for each observable random variable X;, 


E(x; 0) = $1, 
E(X; | 61) = 40,’, 


we find that 
ox? = 7961 
and it follows that 
E(X | 61) = $61, 
ox? — Ng 
+/12n 


The normal approximation to the probability density function px(# | 61) is, 
then, say ; 
V12n Be slaytee 4 
*¥/= — 12n(%— 01/2)*/2601 
v(t) Oe), = > € 
f ( | i) 61 24 3 


and the probability that X will differ from 6,/2 by not more than u(;) is 
approximately equal to, say, 


u(61) 
P| X—4a|suey}=2f  pe@| oy) ae. (16) 
0 
After some easy algebra, this formula reduces to 


= 2 X(q@) 4 
P*{| X — $4| S u(@)} = Leff a"? dt, 
TO 


where 





Sth) be we V12n. (17) 


The requirement that the probability (16) equal the preassigned value a, 
less than unity, determines uniquely the value of \(a) which can be found in 
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any table of the normal integral. For example, if a = .95, then A(a) = 1.96, 
etc. Now equation (17) determines u(@,), namely, 


(a) 
u(61) = 1 /12n 


and it follows that, if we grant the normal approximation, region B(6;) is 
defined by the double relation, 











IIA 


; (; d(a) ) expen (; in X(a) ) (18) 
_— = 4 _— . 
‘\2 12 \2 " /12n 
The region so defined will satisfy, approximately, condition (6). Denote 
by B the set of all regions B(6,) corresponding to all possible values of 6). 
We shall now consider whether or not the set B satisfies conditions (I) and 
(II). For this purpose, we fix an arbitrary sample point (21, 22, +--+, Xn) 
and seek the set, say Sp(21, v2, ***, 2n) of those values of 6 for which 
(21, 22, °° *, tn) € BO). Using definition (18) of the region B(6), we find that, 





for (x1, X2, -++, Xn) € B(@), it is both necessary and sufficient that 
Hy x 
ee ke ee (19) 
1 ks d(a@) 1 A(a) 
2 V12n 2 12n 
Just as in the case of set A, it is seen that the set Sp(a1, v2, ---, tn) Covers a 


closed interval (19). It follows that B is a set of regions of acceptance and 
that the corresponding confidence limits are, say, 





IOS iP. C OA Wg —— 
1 d(a@) 
2° Vian 
(20) 
OLX, Xe. 2 end) = th A 
1 d(a) 
2 V12n 


The interval between these two limits is the confidence interval A(Z) corre- 
sponding to the confidence coefficient a. 

In order to bring out the delicate points of interpretation of formulae (14) 
and (20), we will use numerical examples. Thus, we shall select a = .95 
and substitute n = 12 [this particular value was selected in order to have less 
trouble with the square root of 12n in (20)]. Then formulae (14) and (20) 
reduce to 
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6=X, 6 =(1.284)X, 48(£) = (0.284)X, (21) 
and 
3% = (1.508)X, od& = (2.970)X, A(E£) = (1.462)X, (22) 


respectively. In addition to the confidence limits, the lengths of the con- 
fidence intervals are also given. 

The correct theoretical interpretations of formulae (21) and (22) are 
as follows. 

Theoretical interpretation—If the twelve observable random variables 
X1, Xe, ***, X12 are completely independent and if each of them follows a 
uniform distribution between zero and @> 0, then, whatever the actual 
value 6, of 6 may be, the probability that the greatest X of the X; will not 
exceed 6, and, at the same time, that (1.284) X will not be less than 6, ts 
equal to the preassigned number « = .95, 


P{X S 6, S (1.284)X | 6 = 0} =a = 95. 
Similarly and under the same conditions, we have 


P{(1.508)X < & S (2.970)X | @ = 4} =a = 95. 


From this probabilistic interpretation, we obtain the following operational 
interpretation. 

Operational interpretation.—If the manner of obtaining the particular 
values of twelve variables X1, Xo, +++, X12 1s such that the assumption of 
ther complete independence and uniform distribution between zero and 
some positive number 6 is satisfied with a satisfactory approximation, then 
the long-run relative frequency of cases where X and (1.284)X bracket 86, 
and also of those where (1.508) X and (2.970) X bracket 6, is approximately 
equal to a = .95. 

Practical use of confidence ntervals—The above properties of confidence 
intervals were deduced from the specified assumptions regarding the observ- 
able random variables and, therefore, are the result of deductive reasoning. 


Having understood the meaning of these results, we may now decide (and 


this will be an act of will, not reasoning) to use these results in cases where 
it is desirable to have our actions adjusted to the value of 6 which, unfor- 
tunately, is unknown. Our decision could be to behave as if 7t were known 
for certain that the true value of 6 lies between the lower and the upper 
confidence limits computed from actual observations. The motivation 
behind this rule of behavior is simple: taking into account the operational 
interpretation of confidence intervals, we know that the long-run relative 
frequency of cases where our actions will be adjusted correctly, is equal to 
the number a which we have selected ourselves. 

You will remember that this is just the requirement from a method of 
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estimation which a practical statistician may be reasonably expected to 
address to the theory. 

I wish to emphasize the circumstance that the use of confidence intervals 
involves the following phases: (i) formulation of the problem, (11) deductive 
reasoning leading to the solution of this problem and (iii) an act of will 
to adjust our behavior in accordance with the values of the confidence limits. 
In the past, claims have been made frequently that statistical estimation 
involves some mental processes described as inductive reasoning. The fore- 
going analysis tends to indicate that in the ordinary procedure of statistical 
estimation there is no phase corresponding to the description of “inductive 
reasoning.” This applies equally to cases in which probabilities a priori 
are implied by the conditions of the problem and to cases in which they are 
not. In either case, all of the reasoning is deductive and leads to certain 
formulae and their properties. A new phase arrives when we decide to 
apply these formulae and to enjoy the consequences of their properties. This 
phase is marked by an act of will (not reasoning) and, therefore, if it is 
desired to use the adjective “inductive” in relation to methods of estimation, 
it should be used in connection with the noun “behavior” rather than “rea- 
soning.” The concept of “inductive behavior” is discussed in some detail 
in a book ® in which it is treated as the motivational basis of the whole 
theory of statistics. 

The operational interpretation of formulae (21) and (22) can be easily 
illustrated by a sampling experiment which you may wish to perform. In 
this it is convenient to use one of the published tables of random numbers.’ 
As you know, ordinarily, tables of random numbers give groups of four 
digits, each digit selected at random from 0, 1, 2, -::, 9 with particular care 
that the consecutive selections be independent. Each such group can be 
considered as a decimal fraction with four digits. However, there is always 
the possibility of leaving out a digit or two or of adding a few more digits 
borrowed from the next group in the same line. 

If we decide on a fixed number of digits, for example on three, then the 
consecutive groups in a column will produce an operational equivalent of 


6 J. Neyman: First Course in Probability and Statistics. Henry Holt and Co., New 
York, 1950, 350 pp. 

7 See, for example, the following tables: 

(a) L. H. C. Tippett: Random Sampling Numbers. Tracts for Computers, No. xv, 
The University Press, Cambridge (Eng.), 1927, viii + 26 pp. 

(b) M. G. Kendall and B. Babington Smith: Tables of Random Sampling Num- 
bers. Tracts for Computers, No. xxiv, The University Press, Cambridge (Eng.), 1939 
x + 60 pp. : 

(c) R. A. Fisher and Frank Yates: Statistical Tables for Biological, Agricultural and 
Medical Research. Oliver and Boyd, London, 1938, 90 pp. 


? 
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repeated independent observations of a random variable, say Y, which is 
discrete and is capable of assuming all values from .000 to .999, differing 
by .001. Moreover, all these particular values are approximately equi- 
probable. Obviously, Y, so defined, may be taken as an excellent ap- 
proximation to the variable X distributed uniformly between zero and 
unity. Thus, if we select 6 = 1, the twelve independent “observations” of 
X1, X2, ***, X12 can be read from any column of groups of three digits in 
a table of random numbers. 

However, we need not limit ourselves to the value 6=1. In fact, you 
will find it instructive to select for your sampling experiment a set of, say, 
100 different values (quite arbitrary) of 0. For example, one may be 6, = 1, 
another 62 = .5, still another 6; = 2, etc. In order to obtain the simulated 
twelve observations of random variables uniformly distributed between 
zero and 6, 1, it will be sufficient to take an appropriate number of digits 
in the table, write them as though they formed a decimal fraction and then 
multiply the result by 6,. Naturally, if you want all the “measurements” 
of your “observable random variables” to be made with the same accuracy, 
you will have to use one more digit for 6, = 10 than for 6; = 1, ete. 

Incidentally, I have just said that it would be instructive to embark on 
a sampling experiment with 100 arbitrarily selected 6’s, all different. How- 
ever, I am quite sure that after the fourth or fifth 6 you will become con- 
vinced that the difference in the value of 6 does not influence the relative 
frequency with which the confidence intervals cover the true @ and that, 
thereafter, you will use the simplest value of 0, namely, unity. 

The sampling experiments are more easily performed than described in 
detail. Therefore, let us make a start with 6, = 1, 62 = 2, 03 = 3 and 64 = 4. 
We imagine that, perhaps within a week, a practical statistician is faced 
four times with the problem of estimating 0, each time from twelve obser- 
vations, and that the true values of 6 are as above although the statistician 
does not know this. We imagine further that the statistician is an elderly 
gentleman, greatly attached to the arithmetic mean and that he wishes to 
use formulae (22). However, the statistician has a young assistant who 
may have read (and understood) modern literature and prefers formulae 
(21). Thus, for each of the four instances, we shall give two confidence 
intervals for 9, one computed by the elderly Boss, the other by his young 
Assistant. 

Using the first column on the first page of Tippett’s tables of random 
numbers and performing the indicated multiplications, we obtain the follow- 
ing four sets of figures. 
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TaBLeE [ 





True 8 1st Sample 2nd Sample 3rd Sample 4th Sample 


4=1 69 = 2 63 = 04 = 4 
21 .295 1.368 2.334 .090 
x2 .416 .408 2.923 1.204 
x3 278 1113 2.826 2.996 
L4 .056 .902 ,o04 2.075 
x5 -275 .430 nolo 479 
x6 .587 1.383 1.905 1.563 
X7 .926 1.648 .424 438 
xg . 200 1.583 2.112 727 
x9 .956 1.877 .973 2.483 
X10 .824 .687 2.377 1.901 
1 .566 1.819 2.631 .194 
r12 .101 1.845 753 1.977 
Arithmetic mean ~ .45625 1.25525 1.65200 1.34392 
Greatest observation x .956 1.877 2.923 2.996 
Boss’ conf. interval .688 Ses 1.8922 <¢9s | 2.490505 | 2.026508 
1.355 3.728 4.907 3.992 
Asst.’s conf. interval .956 AS 1.877 36s | 2.923 0s | 2.90 oe 
1.227 2.409 3.752 3.846 


The last two lines give the assertions regarding the true value of @ made 
by the Boss and by the Assistant, respectively. The purpose of the 
sampling experiment is to verify the theoretical result that the long run 
relative frequency of cases in which these assertions will be correct is, 
approximately, equal to a = .95. 

You will notice that in three out of the four cases considered, both asser- 
tions (the Boss’ and the Assistant’s) regarding the true value of @ are 
correct and that in the last case both assertions are wrong. In fact, in this 
last case the true @ is 4 while the Boss asserts that it is between 2.026 and 
3.993 and the Assistant asserts that it is between 2.996 and 3.846. Although 
the probability of success in estimating @ has been fixed at « = .95, the 
failure on the fourth trial need not discourage us. In reality, a set of four 
trials is plainly too short to serve for an estimate of a long run relative 
frequency. Furthermore, a simple calculation shows that the probability 
of at least one failure in the course of four independent trials is equal to 
.1855. Therefore, a group of four consecutive samples like the above, with 
at least one wrong estimate of 0, may be expected one time in six or even 
somewhat oftener. The situation is, more or less, similar to betting on a 
particular side of a die and seeing it win. However, if you continue the 
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sampling experiment and count the cases in which the assertion regarding 
the true value of 6, made by either method, is correct, you will find that the 


relative frequency of such cases converges gradually to its theoretical value, 
@== .95. 

Let us put this into more precise terms. Suppose you decide on a num- 
ber N of samples which you will take and use for estimating the true value 
of 6. The true values of the parameter 6 may be the same in all N cases 
or they may vary from one case to another. This is absolutely immaterial 
as far as the relative frequency of successes in estimation is concerned. In 
each case the probability that your assertion will be correct is exactly equal 
to a = .95. Since the samples are taken in a manner insuring independence 
(this, of course, depends on the goodness of the table of random numbers 
used), the total number Z(N) of successes in estimating 6 is the familiar 
binomial variable with expectation equal to Na and with variance equal 
to Na(1 — a). Thus, if N = 100, e = .95, it is rather improbable that the 
relative frequency Z(N)/N of successes in estimating 6 will differ from a 


by more than 
iI aos 
2 dl iia Ea) 
N 


This is the exact meaning of the colloquial description that the long run 
relative frequency of successes in estimating 6 is equal to the preassigned «. 

Your knowledge of the theory of confidence intervals will not be influ- 
enced by the sampling experiment described, nor will the experiment prove 
anything. However, if you perform it, you will get an intuitive feeling of 
the machinery behind the method which is an excellent complement to the 
understanding of the theory. This is like learning to drive an automobile: 
gaining experience by actually driving a car compared with learning the 
theory by reading a book about driving. 

Among other things, the sampling experiment will attract attention to 
the frequent difference in the precision of estimating 6 by means of the two 
alternative confidence intervals (21) and (22). You will notice, in fact, 
that the confidence intervals based on X, the greatest observation in the 
sample, are frequently shorter than those based on the arithmetic mean X. 
If we continue to discuss the sampling experiment in terms of cooperation 
between the eminent elderly statistician and his young assistant, we shall 
have occasion to visualize quite amusing scenes of indignation on the one 
hand and of despair before the impenetrable wall of stiffness of mind and 
routine of thought on the other. For example, one can imagine the con- 

8 Sad as it is, your mind does become less flexible and less receptive to novel ideas 


as the years go by. The more mature members of the audience should not take offense. 
I, myself, am not young and have young assistants. Besides, unreasonable and stubborn 
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versation between the two men in connection with the first and third samples 
reproduced above. You will notice that in both cases the confidence interval 
of the Assistant is not only shorter than that of the Boss but is completely 
included in it. Thus, as a result of observing the first sample, the Assistant 
asserts that 


956 S 6 S 1.227. 


On the other hand, the assertion of the Boss is far more conservative and 
admits the possibility that 6 may be as small as .688 and as large as 1.355. 
And both assertions correspond to the same confidence coefficient, a = .95! 
I can just see the face of my eminent colleague redden with indignation and 
hear the following colloquy. 


Boss: “Now, how can this be true? I am to assert that @ is between .688 and 1.355 
and you tell me that the probability of my being correct is .95. At the same time, 
you assert that 6 is between .956 and 1.227 and claim the same probability of success in 
estimation. We both admit the possibility that @ may be some number between 688 
and .956 or between 1.227 and 1.355. Thus, the probability of 6 falling within these 
intervals is certainly greater than zero. In these circumstances, you have to be a nit-wit 
to believe that 


P{.688 < 6 < 1.355} = P{.688 <0 < .956} + P{.956 <6 < 1.227} 
+ P{1.227 <6 < 1.355} 
= P{.956 <6 < 1.227}.” 


Assistant: “But, Sir, the theory of confidence intervals does not assert anything 
about the probability that the unknown parameter 6 will fall within any specified limits. 
What it does assert is that the probability of success in estimation using either of the 
two formulae (21) or (22) is equal to a.” 

Boss: “Stuff and nonsense! I use one of the blessed pair of formulae and come up 
with the assertion that 688 < @< 1.355. This assertion is a success only if 6 falls within 
the limits indicated. Hence, the probability of success is equal to the probability of 
6 falling within these limits——.” 

Assistant: “No, Sir, it is not. The probability you describe is the a posteriori prob- 
ability regarding 6, while we are concerned with something else. Suppose that we con- 
tinue with the sampling experiment until we have, say, N = 100 samples. You will see, 
Sir, that the relative frequency of successful estimations using formulae (21) will be 
about the same as that using formulae (22) and that both will be approximately equal 
to .95.” 


I do hope that the Assistant will not get fired. However, if he does, I 
would remind him of the glory of Giordano Bruno who was burned at the 
stake by the Holy Inquisition for believing in the Copernican theory of the 
solar system. Furthermore, I would advise him to have a talk with a physi- 
cist or a biologist or, maybe, with an engineer. They might fail to under- 


individuals are found not only among the elderly but also frequently among young 
people. 























STATISTICAL ESTIMATION 215 


stand the theory but, if he performs for them the sampling experiment 
described above, they are likely to be convinced and give him a new job. 
In due course, the eminent statistical Boss will die or retire and then——. 

Now, let us forget the Boss and his Assistant and return to the important 
problem of the varying length of confidence intervals. By inspecting formulae 
(21) and (22), it is easy to see that for certain sample points the confidence 
interval (22), based on the arithmetic mean X, is shorter than the correspond- 
ing confidence interval (21). When the greatest observation X is fixed, the 
arithmetic mean X of twelve positive observations may be arbitrarily close 
to X/12. Thus, the length of the corresponding confidence interval (22) may 
approach the lower bound of 


(1.462) X 
12 


This circumstance makes it intuitively clear how it happens that the use 
of either formulae (21) or (22) insures the same frequency of successes. 
However, if you perform the sampling experiment described above, you 
will notice that in the great majority of cases the confidence intervals com- 
puted from (21) are substantially shorter than those computed from (22). 
This empirical result suggests that, from the point of view of precision in 
estimating 6, formulae (21) are preferable to formulae (22). However, 
it is possible that some third pair of confidence limits can be invented, cor- 
responding to the same confidence coefficient, which will give still better 
precision in estimating 0. 

We are brought, thus, to the problem of a choice among the various pos- 
sible confidence intervals and you will appreciate that this problem is of 
considerable practical importance and of great theoretical interest. Our 
first difficulty in attacking the problem consists in formulating it so that 
it has a definite mathematical meaning. In the case of known a priori dis- 
tributions, the situation was simple because the Bayes’ estimating interval 
corresponding to any given sample point is selected independently from 
those corresponding to other possible sample points. Therefore, we could 
simply seek that estimating interval which is shortest. With confidence 
intervals the situation is different because, instead of dealing directly with 
confidence intervals corresponding to particular sample points, we deal with 
regions of acceptance and the confidence interval corresponding to any given 
sample point depends upon the way in which the regions of acceptance are 
piled above this point. By shifting the regions of acceptance, it is possible 
to reduce to a minimum the length of the confidence interval corresponding 
to a specified sample point. However, it is intuitively clear that by so 
doing we shall increase the length of a great many other confidence intervals 
which correspond to different sample points. Thus, it appears that the 


= (.122)X < 6(E) = (.284)X. 


216 MATHEMATICAL STATISTICS AND PROBABILITY 


problem of “optimum” must concern not the length of particular confidence 
intervals taken separately, but the totality of these intervals. 


Figure 2 
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The problem I formulated in my paper of 1937 is based on the following 
considerations. The desirable property of a confidence interval is that it 
covers the true value of the estimated parameter 6 with the preassigned 
frequency a. In so doing, the confidence interval also covers an infinity of 
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“false” values of 6. This, however, is a nuisance and should occur as rarely 
as possible. When one starts from this point of view, it is easy to give an 
exact definition of the shortest confidence intervals. 
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Definition —T he confidence interval §(E) for estimating the parameter 86, 
corresponding to the confidence coefficient a, 1s called the shortest if, what- 
ever be the alternative confidence interval A(E) corresponding to the same 
confidence coefficient « and whatever be two possible values 0, and 62 of 
the estimated parameter 6, 
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P{5(E) C 0, | 62} S P{A(E) C 4 | do}. 


Operationally, this means that, whatever be the true value 62 and whatever 
be the “false” value 6;, this false value will be covered by 6(#) not more 
frequently than it will be covered by A(£). 

It can be shown that the confidence interval (21) is the shortest in the 
sense of this definition. As you see, the “shortness” of the confidence inter- 
val 8(EZ) considered as a random interval used for purposes of estimation 
is consistent with the fact that for some particular sample points the length 
of interval (21) exceeds that of interval (22). 

Unfortunately, in a great many cases of practical importance the shortest 
confidence intervals do not exist and we are forced to look for other possi- 
bilities. The study of these questions is an interesting and important part 
of the theory of estimation. However, its discussion would lead us far 
afield and all that I can do here is to refer you to my papers of 1937 and 
1938 already quoted. At the present moment we will return to the problem 
of interpretation of confidence intervals and of Bayes’ estimating intervals. 
Needless to say, a clear understanding of the difference between these two 
approaches to the problem is of fundamental and immediate importance to 
everyone concerned with estimation. 

Figures 2 and 3 refer to the example of estimating 6, the only parameter 
involved in the probability density function (8) of m = 12 independent 
variables X1, Xe, °++, X, all uniformly distributed between zero and 6 > 0. 
The two figures show confidence intervals (21). Thus the quantity meas- 
ured on the axis of abscissae is X, the greatest of the twelve observations. 
The quantity measured on the axis of ordinates is 6. The heavy diagonal 
line has the equation 9(X) = X and represents the lower confidence limit 
for 6. The dashed straight line above represents the upper confidence limit 


B(X) = (1.284) xX. 


Thus whatever be the observed value of X, the corresponding confidence 
interval for 6 can be read directly from either of the two figures. The con- 
fidence intervals would be used when nothing is known about the a priori 
distribution of 6 and the assertions regarding the true value of 6 obtained 
from the graphs will be true with a long run relative frequency equal to .95. 
Figures 2 and 3 also display the classical Bayes’ estimating interval in 
addition to the confidence intervals. On both graphs the Bayes’ estimating 
intervals correspond to an a priori distribution of 6 of the same form, 


(6) = men} for0 <6 <1, 
(23) 
= () elsewhere. 
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However, in Figure 2 the value of m is m = 12.1 while in Figure 3, it is 
m = 4.5. In both cases, the lower end of the classical Bayes’ estimating 
interval coincides with the lower confidence limit, g(X) = X. On the other 
hand, the upper end of the classical Bayes’ estimating interval is given by, 
say, 


O3(X) = [X™-"(1 — a) + a] ™, 


where « = .95 is the chosen confidence coefficient. The estimation of 6 
may consist in observing X and in asserting that 6 lies within the corre- 
sponding Bayes’ estimating interval. The long run relative frequency of 
successes will again be equal to a. 

Upon inspecting the two figures, you are likely to have a feeling of sur- 
prise. Figure 2, especially, is striking because the Bayes’ estimating inter- 
vals are so much wider than the corresponding confidence intervals over a 
very wide range of values of X. And yet, the Bayes’ intervals are the 
shortest possible adjusted to the a prior: distribution of 6 of the particular 
kind indicated while the confidence intervals which assure the same long 
run frequency of successes in estimation will give this frequency quite irre- 
spective of whether the a prior distribution is that visualized in Figure 2 
or any other. Only if X > .77 (approximately) are the confidence intervals 
wider than the Bayes’ intervals and then the difference in width is much 
milder than that, in favor of confidence intervals, for X < .77. 

The answer is that, owing to the special form of the a prior: distribution, 
larger values of @ will occur much more frequently than smaller ones and 
this is reflected also in the absolute distribution of X. This distribution is 
easily obtained as follows. We begin by writing the joint distribution of 
6 and X, 


Deans a ee for 0'< 2S land 2S 0 Sil 
=0 elsewhere. 
The absolute distribution of X is obtained by integrating 
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(24) 


0 elsewhere. 


It is easy to verify that, if m and n exceed unity, this probability density 
vanishes at x = 0 and z = 1 and has its maximum at 


Pat 1/(m—n) 
a ( ) . 
m— 1 
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If m = 12.1 and n = 12, then this value exceeds .9. Thus in the situation 
represented in Figure 2, the most frequent values of X are close to unity 
and, consequently, the confidence intervals most frequently used will exceed 
the corresponding Bayes’ shortest estimating intervals. A similar situation 
corresponds to Figure 3. 

This reasoning explains only one aspect of the situation. To understand 
fully the other aspect we have to visualize the interpretation of the Bayes’ 
estimating intervals in terms of frequencies. This may be done with refer- 
ence to general human experience or, in order to speak in more concrete 
terms, with reference to a sampling experiment appropriately arranged so 
as to satisfy the hypotheses underlying Figure 2 and Figure 3. 

It is essential that one makes the point clear that a sampling experiment 
which might illustrate all the properties of Bayes’ estimating intervals is 
much more complicated than the one discussed above whose purpose was 
to illustrate the working of confidence intervals. In dealing with confidence 
intervals we were at liberty to select an arbitrary set of positive numbers 
and to consider these numbers as the true values of 6. Then it was an easy 
matter to use the tables of random numbers in order to obtain a sample of 
twelve observations following the distribution (8). Now we have to begin, 
as it were, earlier and create a machinery for obtaining a set of consecutive 
values of 6 following the a priori distribution appropriate to Figure 2 and/or 
Figure 3. No arbitrary selection of the true 6’s is allowed. 

After this point has been settled in one way or another (those who are 
familiar with the arranging of sampling experiments will have no difficulty 
with this step), we proceed to obtain sampled values of X1, Xe, -+-, X12, 
corresponding to each value of @ already determined. As in the case of 
confidence intervals (21), we shall be interested not in all twelve sample 
values but only in the greatest of them, X. The frequency distribution of 
this variable will correspond to the probability density (24). Now suppose 
that the first sample ascribes to X the value, x = .500. The Bayes’ estimat- 
ing interval corresponding to this value, as read from Figure 2, extends from 
.500 to .967, approximately. In order to interpret this interval (.500, .967) 
in detail, we would have to continue the sampling experiment for quite some 
time until we observed x = .500 another time, then still another time, etc. 
In short, for the interpretation of the Bayes’ interval (.500, .967), we need 
a long sequence of outcomes of the sampling experiment in which the 
greatest of the twelve observations is equal to .500. It is obvious that the 
actual performance of this experiment is impractical unless it is performed 
using the most modern high speed computing machines. However, this 
should not, preclude us from discussing it. 

Imagine that, after we have repeated the sampling experiment a few 
million times, each time first determining a fresh value of 6 in accordance 
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_with probability density function (23) and then getting the twelve values 


of the X’s, we have finally selected a set, say S(.5), of some 100 cases in 
which the value of X was exactly equal to .500. This set, S(.5), is basic 
for the interpretation of the Bayes’ interval (.500, .967). Naturally, the 
values of 6 corresponding to all experiments in S(.5) will be different in 
general and we shall visualize the distribution of these values. In accord- 
ance with what was explained in yesterday’s conference, this distribution 
will correspond approximately to the a posteriori probability density func- 
tion, 
m—- nN 


Las aaa With Sse AMO OTE TS a hs 


= 0 elsewhere, 


with x = .500 and the interval (.500, .967) will be found the shortest of all 
those which include 95 per cent of the values of 0. 

This is the precise interpretation of the Bayes’ shortest estimating inter- 
vals. If you compare the foregoing with our previous discussion, you will 
see that confidence intervals do not have the property just described. In 
fact, if we take under consideration any particular confidence interval, e.g. 
the confidence interval (.500, .642) corresponding to the same value of 
x = .500, the relative frequency of experiments forming the set S(.5) in 
which .500 s 6s .642 will depend on the a priori distribution of @ and, in 
general, will not be equal to .95. On the other hand, whatever be the 
a priori distribution of 0, the assertion regarding the value of 6 in any 
particular case, based on the confidence interval, has the probability equal 
to « = .95 of being correct. 

Before concluding, we shall make a very brief review of early papers of 
several authors in which one can discern the germs of the theory of con- 
fidence intervals. 

The idea of estimation by confidence intervals and by confidence regions 
is very clearly and faultlessly stated in a few last sentences of a paper by 
Hotelling ® published in 1931. However, the statement of this idea was not 
followed by an attempt to develop a systematic theory. The relevant pas- 
sage is very brief and its brevity must have contributed to its being over- 
looked by many readers including myself. In order to give full credit to 
Hotelling, I wish to reproduce the passage verbatim. 


To means of a single variate it is customary to attach a “probable error,” with the 
assumption that the difference between the true and calculated values is almost cer- 
tainly less than a certain multiple of the probable error. A more precise way to follow 


9 Harold Hotelling: “The generalization of Student’s ratio.” Annals of Math. Stat., 
Vol. 2 (1931), pp. 360-378. 
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out this assumption would be to adopt some definite level of probability, say P = .05, 
of a greater discrepancy, and to determine from a table of Student’s distribution the 
corresponding value of t, which will depend on n; adding and subtracting the product 
of this value of t by the estimated standard error would give upper and lower limits 
between which the true values may with the given degree of confidence be said to lie. 
With 7 an exactly analogous procedure may be followed, resulting in the determination 
of an ellipse or ellipsoid centered at the point £1, é, ***, &. Confidence corresponding 
to the adopted probability P may then be placed in the proposition that the set of true © 
values is represented by a point within this boundary. (Harold Hotelling: Annals of 
Math. Stat., Vol. 2, pp. 377-378.) 


Next in turn, in the reverse chronological order, it would be necessary to 
refer to papers by R. A. Fisher concerned with the so-called “fiducial argu- 
ment.” The early papers of Fisher given to this subject definitely suggest 
the idea of confidence intervals. Later on, however, there appeared to be 
a substantial difference between the two theories. The relevant literature 
is extensively discussed in the next part of the present chapter. 

Before either Hotelling or Fisher, the idea of confidence intervals is found 
in papers by E. B. Wilson ?° and Stanislas Millot.1t Both authors are con- 
cerned with estimating the probability p of success postulated to be con- 
stant in n completely independent trials in which the success occurred 
exactly X times. Working independently, the two authors used similar 
arguments to deduce the approximate confidence interval for p, based on 
the assumption that the distribution of the standardized binomial variable, 
say 

xX — np 
Vnp(1 — p) 


is approximately normal. However, the conceptual background of the two 
papers is essentially different from the statement of the problem of con- 
fidence intervals and is limited to the view that it is reasonable to use the 
formula deduced. Wilson explains this clearly in his more recent paper ?” 
given to the same problem. In addition, the paper of Millot involves obvious 
misunderstandings which it may be useful to discuss. For this purpose, we 
shall deduce the formulae for the (approximate) lower and upper confidence 
limits for p. 

We begin by postulating that Y = X/n is a normal variable with expec- 
tation p and variance p(1 — p)/n, where p stands for the unknown true 


10K. B. Wilson: “Probable inference, the law of succession, and statistical inference.” 
Jr. Amer. Stat. Assoc., Vol. 22 (1927), pp. 209-212. 

11 Stanislas Millot: “Sur la probabilité a posterior.” Comptes Rendus, Paris Acad- 
emy, t. 176 (1923), pp. 30-32. 

12. B. Wilson: “On confidence intervals.” Proc. Nat. Acad. Sc., Vol. 28 (1942), 
pp. 88-93. 
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probability of success and may be any number 0<p<1. Following the 
steps indicated in the earlier part of this conference we proceed to construct 
a system A of regions of acceptance, say A(p*). Obviously, the sample 
space of the variable Y is limited to the interval 0S Ys1. Fix any pos- 
sible value p’ of p, and determine in W a region A(p’) satisfying condition 
(6). Considerations of simplicity suggest that the region A(p’) be repre- 
sented by an interval, say from a(p’) to b(p’) with 


0 S a(p’) S b(p’) S11. 

Then condition (6) implies that, if p’ happens to be the true value of 7, 

P{a(p') SY S$ bi’) |p = p'} =a, 
where a is the adopted confidence coefficient. Using the postulate that Y 
is a normal variable, we can rewrite this condition as 

a/n live 

V 2rp’(1 — p’) Jar’) 

Obviously, equation (25) does not determine a(p’) and b(p’) uniquely. In 
fact, it is possible to select a(p’) arbitrarily, provided the value selected 
is not too large, and then to determine b(p’) to satisfy condition (25). 


Considerations of simplicity suggest that a(p’) and b(p’) be symmetrically 
placed about p’ so that 


zk BNE "1—’ 
eu P)7/2P'1—P') yy — oy, (25) 


a(p’) = p’ — A, 
b(p') =p’ + A. 


Now, if \ satisfies the condition 





then 


and the region of acceptance A(p’) is defined by the double relation, 
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We shall adopt this definition of A(p’) for every possible value p’ of p 
and test whether or not the set A of all such regions satisfies conditions (I) 
and (II). For this purpose we fix an arbitrary possible value y of Y and 
look for the values p for which ye A(p). The search reduces to the solution 
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with respect to p’ of the two inequalities defining region A(p’). If we drop 
the primes and perform easy algebra, we find 


np Hench eee 
n 


gt. 2 
Y?—-2pY+p’?s edad 
nN, 
or 
2 2 
(1+~)-2(v+_)+r7 50. (26) 
n 2n 


Since the coefficient of p? is positive, the left hand side of inequality (26) 
is negative only if the two roots of the quadratic are real and the value of p 
is contained between the smaller and the larger of these roots. Denoting 
the roots by p,(Y) and po(Y), we have 
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and it is seen that they are always real. Thus, the set of values of p for 
which Y « A(p) extends over the closed interval, 


pi(Y) Sp S p2(Y). 


Consequently, the regions A(p) are regions of acceptance and p,(Y) and 
p2(Y) are a pair of confidence ** limits for p, corresponding to the confidence 





13 Incidentally, a closer analysis shows that these limits possess the defect of being 
“biased.” While covering the “true” value with the prescribed relative frequency a, the 
confidence interval [pi(Y), po(Y)] covers certain “false” values of p even more fre- 
quently. This fact is due to the adopted symmetry of regions of acceptance. By drop- 
ping the requirement of symmetry, it is possible to obtain somewhat “shorter” confi- 
dence intervals corresponding to the same confidence coefficient and covering the false 
values of p less frequently than the true value. However, this advantage of the un- 
biased confidence intervals is, in this case, not very important. When n is large, then 
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coefficient «. Thus, if we use the formulae for p,(Y) and p2(Y) to make 
assertions regarding the true value of p in the form p;(Y) spsp2(Y), 
the probability of this assertion being true is (approximately) equal to a. 

In connection with confidence intervals for the binomial p, I should bring 
to your attention the fact that Clopper and Pearson !* have produced con- 
venient graphs from which these intervals can be read directly. I should 
also like to note that, if one does not assume n large enough for the validity 
of the normal approximation, then one has to deal with an extended notion 
of confidence intervals in which the probability of the true value of the 
parameter being covered is at least equal to (instead of equal or approxi- 
mately equal to) the chosen confidence coefficient. The method of construct- 
ing such intervals is discussed and illustrated in the joint publication ?° of 
Matuszewski, Supinska and myself. 

As mentioned, the formulae for p,(Y) and po(Y) were deduced both by 
E. B. Wilson and by Stanislas Millot. However, Millot interprets them as 
a result relating to the probability a posteriori, with which, in reality, these 
formulae have no connection whatever. Moreover, the following passage 
translated from Muillot’s note indicates that his idea regarding the opera- 
tional properties of the interval [pi(Y), po(Y)] were in disaccord with the 
basic concepts of confidence intervals. 

Maillot writes: 


It is useful to record the results of the various experiments made, because, frequently 
but to a variable extent, the study of partial series of such experiments may allow us 
to reduce the uncertainty regarding the true value of the probability p. To each partial 
series, aS well as to the total series of observations there corresponds an interval for p, 
the boundaries of which are determined from formulae (5). Evidentally, the probability 
p is contained in the common part of all the intervals thus obtained. 


The statement “Evidentally, the probability p is contained in the com- 
mon part of all the intervals thus obtained,” has no probabilistic meaning 
and, therefore, has no room within the theory of confidence intervals. If 
we admit the possibility of a lapsus linguae and try to reword this statement 
in conformity with the concepts of confidence intervals, we would obtain 
something like this: “The probability that the common part of all intervals 
thus obtained will bracket the true value of p is even greater than the 


the difference between the unbiased intervals and the ones deduced here is insignificant. 
On the other hand, when n is small, then the normal approximation which we used here 
is inadequate. 

14C, J. Clopper and E. S. Pearson: “The use of confidence or fiducial limits illustrated 
in the case of the binomial.” Biometrika, Vol. 26 (1934), pp. 405-413. 

15T, Matuszewski, J. Neyman and J. Supinska: “Statistical studies in questions of 
Bacteriology. Part I. The accuracy of the ‘Dilution Method.’” Supplement, Jr. Roy. 
Stat. Soc., Vol. 2 (1935), pp. 63-82. 
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confidence coefficient «.” However, even this interpretation does not bring 
the idea of Millot within the framework of confidence intervals. In the 
latter, the probability of bracketing the true value of the parameter applies 
to a completely specified rule, i.e., to a pair of functions, such as p;(Y) and 
po(Y), defined for all possible values of the observable random variables. 
As I have shown this morning, this probability coincides with the prob- 
ability of the sample point falling within the region of acceptance corre- 
sponding to the true value of the parameter estimated. Also, I have shown 
that the regions of acceptance are uniquely determined by the confidence 
limits. Now, while implying what should be our assertion regarding p 
when the several confidence intervals overlap, Millot does not say a word 
about this assertion when the confidence intervals fail to overlap. Thus, 
Millot’s estimating intervals are not defined for all combinations of values 
of the observable random variables and, therefore, the regions of acceptance 
are not defined. As a result, without further specification of the estimation 
procedure contemplated, it is impossible to assert anything about the prob- 
ability that it will lead to a correct assertion. 

In addition to the note discussed, Millot has published a few more notes 
in the same volume of Comptes Rendus. However, the general idea behind 
these notes diverges more and more from the basic concept of confidence 
intervals expressed at the beginning of the first note. 

If we go further back, we can trace the idea of confidence intervals, very 
vaguely expressed, in the writings of “Student.” Also, although no explicit 
statement has been found, it is possible that the idea of confidence intervals 
may have been behind the publications of Markoff and even of Gauss, 
concerned with what is now called “best unbiased estimates.” 

This brings us to the question of how the use of confidence intervals can 
be considered a justification for the use of best unbiased estimates and of 
maximum likelihood estimates (when they are consistent and efficient). 
Actually, the argument is in favor of a broader category of estimates, 
having the property that they are asymptotically normal about the true 
value of the parameter with minimum asymptotic variance. This point of 
view was brought out in my paper of 1934 already quoted. 

Let 6 be a parameter to be estimated using a large number n of observable 
random variables, the totality of which will be denoted by a single letter X,. 
Let, further, F,(X,) denote a function of X, and o,(6) a function of n and 
6 but not of X,, having the property that, as n> o, for all A > 0, 


ig xt ‘ea 0 2 x 2 
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By selecting an appropriate A, the integral in the right hand side may be 
made equal to the chosen confidence coefficient «. Hence, if n is sufficiently 
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large, the probability in the left hand side of this equation will differ but 
very little from «. But this probability coincides with the probability, 


P{F (Xn) — don() S 0 S Fa(Xn) + don(O)}, 
which indicates that the two limits, say 
6 = F,(Xn) — Aon(9) and 6 = F,(Xn) + don (6), 


have the properties of an (approximate) confidence interval corresponding 
to the confidence coefficient a. It is true that, without the knowledge of 6, 
these limits may be impossible to compute. The important fact, however, 
is that the length of the interval indicated is 2A0c,(@) and thus is a fixed 
multiple of o,(6). Thus, if a number of functions like F,(X,) are avail- 
able for estimating 6, the tendency towards the greatest precision in esti- 
mation as measured by the length of the approximate confidence interval 
implies a preference for those estimates for which the asymptotic variance 
on“(6) is smallest. This, then, is one of the rational justifications, which 
may be brought forward, for the use of best unbiased and maximum likeli- 
hood estimates in the (frequent) cases in which they are asymptotically 
normal and efficient. It will be noticed, however, that this justification has 
nothing to do with any sort of principle or axiom but is based on purely 
utilitarian considerations of consequences of repeated application of the 
procedure described. 

It may be worthwhile to emphasize that the justification for the use of 
best unbiased estimates explicitly stated by Gauss is a different one. As 
Laplace had already noticed, the process of estimating an unknown param- 
eter 6 may be compared with a game of chance in which a statistician, using 
an estimate F,(X,), may lose a positive quantity [when F,(X,) ~ 6] or 
may break even [when F,,(X,) = 6], but in which he can never gain. The 
quantity lost is, therefore, a monotone increasing function of the absolute 
value of the difference | F,(X,) — 6], the nature of which, however, cannot 
be deduced from the general circumstances of the problem of estimation. 
Thus, this loss function, say L[F,(X,) — 6], may be selected arbitrarily 
in conformity with each particular problem of estimation. Once the loss 
function is selected, the goodness of any particular estimate F,(X,) may 
be measured by the expectation, say “risk,” 


R[F p(Xn), 6] E{ LF, (Xn) oI él}, 


of the loss which will be incurred when F,,(X,,) is used as an estimate of 6. 

Laplace himself studied certain problems on the assumption that the 
loss due to an error in estimation is directly proportional to the absolute 
value of the error. On the other hand, Gauss noticed that various results 


228 MATHEMATICAL STATISTICS AND PROBABILITY 


became more elegant if the loss is assumed to be proportional to the square 
of the error committed so that 


Upon reflecting on the general nature of errors of measurements, in par- 
ticular, on the possibility of systematic errors, Gauss found it necessary to 
impose on the estimate F,,(X,) another condition, that of unbiasedness, 
expressed by identity, 


EF, (Xn)] = 8 


It will be seen that the two conditions, one of the unbiasedness of Fy, (Xn) 
and the other of minimum expected loss measured by the square of the 
error, formulate the now familiar problem of best unbiased estimates. All 
this was reported to the K6nigliche Societaét der Wissenschaften in Gdott- 
ingen on February 15, 1821, and subsequently published in Latin. A Ger- 
man translation by A. Borsch and P. Simon appeared in a book under the 
general title, Abhandlungen zur Methode der kleinsten Quadrate von Carl 
Friedrick Gauss, Berlin, 1887, pp. v + 208. I enter into these bibliographi- 
cal details partly in an attempt to correct a confusion to which I unwit- 
tingly contributed by attributing to Markoff the basic theorem on least 
squares. See, for example, F. N. David and J. Neyman: “An extension of 
the Markoff theorem on least squares,” Stat. Research Memoirs, Vol. I1 
(1938), pp. 105-116. As R. L. Plackett pointed out in his “A historical 
note on the method of least squares” (Biometrika, Vol. 36 (1949), pp. 458— 
460), the theorem that I ascribed to Markoff was discovered by Gauss and 
published in the remarkable memoir just quoted.*® 

Early in the present century, the idea of the loss aanaiten attracted the 
attention of F. Y. Edgeworth who, under the label of “detriment” discussed 
it in a number of his papers, published in Mind and in the Journal of the 
Royal Statistical Society. In one of these papers (Jr. Roy. Stat. Soc., Vol. 
71, 1908, and 72, 1909) he was in search of the ‘most advantageous” esti- 
mates, that is, such estimates as would, in large samples, minimize the 
average detriment. Edgeworth, anticipating Fisher by thirteen years, con- 
ceived the conviction that the “most advantageous” estimates (in present 
day terminology, asymptotically normal estimates with minimum asymp- 
totic variance) are those obtained by the “genuine inverse method,” or as 
we say now, following Fisher, the maximum likelihood estimates. 

After Edgeworth, the idea of the loss function was lost from sight for 
more than two decades, to be revived in a paper by E. S. Pearson and 

16 For more recent developments in this direction, see E. W. Barankin and John 


Gurland: “On asymptotically normal, efficient estimators: I,” University of California 
Publications in Statistics, Vol. 1, No. 6 (1951), pp. 89-130. 
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myself, “The testing of statistical hypotheses in relation to probabilities 
a prior’ (Camb. Phil. Soc. Proc., Vol. 29 (1933), pp. 492-510). Unfor- 
tunately, at that time we were not aware of the fact that the idea was not 
new. Also, being preoccupied with other problems, we just mentioned the 
idea as a possible approach to the problem without attempting to do any- 
thing concrete. The real revival of the idea of the loss function and of the 
associated risk function began with the entry on the scene of statistical 
research of Abraham Wald. Combining the concept of loss with another 
concept of minimax (also outlined in the above publication of 1933), Wald 
has initiated a new branch of statistical theory and, followed by Wolfowitz 
and a host of younger searchers, brought it to a remarkable level of elegance 
and generality. The principal results obtained in this direction are sum- 
marized in the recent book of Wald: Statistical Decision Function (Wiley, 
New York, 1950, pp. ix + 179). 


Part 3. Fiducial Argument and the Theory of Confidence Intervals 


(This section has been reproduced from Biometrika, Vol. 32 (1941), pp. 128-150, through 
the courtesy of the Editor, Professor E. §. Pearson.) 


1. INTRODUCTION 


The theory of confidence intervals was started by the present author 
about 1930. At that time it was taught in lectures given both at the Uni- 
versity and at the Central College of Agriculture, Warsaw, Poland. The 
theory found immediate practical applications, and before any theoretical 
paper was published, a booklet (Pytkowski, 1932) * appeared giving numer- 
ical confidence intervals for means and for regression coefficients. The term 
“confidence interval” is a translation of the original Polish “przedziat 
ufnosei.” The author’s theoretical results appeared two years later (Ney- 
man, 1934). At almost the same time the first tables and graphs of con- 
fidence intervals were published (Clopper & Pearson, 1934) in a paper 
which gave a remarkably clear explanation of the difference between the 
new approach to the problem of estimation and the old one, by means of 
Bayes’s theorem. 

The first publication on fiducial argument (Fisher, 1930) anticipated the 
booklet of Pytkowski by two years. The present author overlooked this 
article for some time. However, when preparing his paper of 1934, he was 
already acquainted with it and also with the next paper (Fisher, 1933) on 
a similar subject. Although Fisher’s method of approach was entirely 
different from the author’s, the numerical identity of Fisher’s fiducial limits 


1 The references cited are given in full at the end of this Part, on pages 253-254. 
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with the confidence limits in the author’s theory, and also some of Fisher’s 
early comments, suggested to the author that the two theories are essentially 
the same. Accordingly, and owing to the difference in dates of publications, 
the author considered his own work as an extension of the previous results 
of Fisher. This was clearly stated in the author’s paper of 1934. 

Apart from the above points of agreement the author had found certain 
passages and conceptions in the publications of Fisher which were difficult 
for him to understand and to reconcile with what was essential in the theory 
of confidence intervals. They included “fiducial probability” and “fiducial 
distribution of a parameter.” However, the author was inclined to think 
that these were, more or less, lapsus linguae, difficult to avoid in the early 
stages of a new theory. This attitude was clearly expressed in the paper 
of 1934. That paper was read before a meeting of the Royal Statistical 
Society and was followed by a public discussion recorded in the Society’s 
Journal. Fisher took part in the discussion, and it was a great surprise to 
the author to find that, far from recognizing them as misunderstandings, 
he considered fiducial probability and fiducial distributions as absolutely 
essential parts of his theory. As a result, the author began to doubt whether 
the two theories were, in fact, equivalent. These doubts were only increased 
by Fisher’s insistence that the calculation of fiducial distributions and fidu- 
cial limits must be limited to cases where sufficient statistics exist (Fisher, 
1936), and by his warnings against inconsistencies in the theory of con- 
fidence intervals. 

When questioned on the subject, the author could not conceal his doubts 
and they were published (Neyman, 1938a). Subsequent publications by 
other authors appear to be divided. Some, e.g. the very important papers 
by Wald (1939) and by Wald & Wolfowitz (1939), deal with the theory 
of confidence intervals, entirely ignoring fiducial theory. Others (Starkey, 
1938; Sukhatme, 1938; Yates, 1939), at the other extreme, work on the 
ground of fiducial argument and ignore the confidence intervals. There is 
also an intermediate group of authors with an almost continuous spectrum 
of opinions. Pitman (1939), in a very interesting paper on estimation of 
location and scale parameters, states that the two theories “are essentially 
the same and that their two points of view are both necessary for a full 
comprehension of the theory of estimation.” And a few pages further: “I 
at first called it the fiducial probability function, but finally decided to 
shorten the name by dropping the word ‘probability.’ ” 

Next we find the statement (Bartlett, 1939) that “by a distribution of 
fiducial type we shall mean a distribution providing at least confidence 
intervals in the sense of Neyman.” This statement is used in an argument 
(Bartlett, 1936, 1939) that, as a distribution deduced by Fisher (1936) does 
not seem to provide confidence limits, there must be some error in the 
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deduction. A similar point of view, but with a stronger leaning towards 
confidence intervals, is expressed by Welch (1939). In this paper various 
general claims of Fisher are analyzed, essentially from the point of view 
of confidence intervals, and tested on appropriate examples. Among other 
things it is found that the fears of inconsistencies in the theory of confidence 
intervals are unfounded. 

A quite different school of thought is represented by Jeffreys (1940), 
according to which the fiducial approach to the problem of estimation is 
completely equivalent with that by inverse probability. 

Fisher (1937, 1939a, 19396) and Yates (1939) emphatically deny that 
there is an error in Fisher’s paper of 1936. On the contrary, it is said that 
the results then published were obscured by the controversy arising from 
Bartlett’s confusion about the nature of fiducial argument. Also, especially 
in earlier papers (1930, 1933, 1936), Fisher is equally emphatic on the dis- 
tinction between the fiducial and the inverse probability approaches to the 
problem of estimation. 

The above survey shows that there is an interesting divergence of opinions 
as to what is essential in the fiducial theory in general and as to whether 
it is in any way connected with the theory of confidence intervals. The 
perusal of all the literature quoted does not allow the present author to 
form any precise opinion as to the first of these questions. On the other 
hand, there now seems to be sufficient ground for answering the second, 
concerning the relationship between the two theories. The purpose of the 
present paper is to show that there is none. The relevant points concerning 
this question, which were possible to establish on the ground of earlier 
literature, are explained in excellent papers by Pearson (1939) and Welch 
(1939), with the final conclusion that, in spite of various differences, the 
two theories are closely related. However, fresh evidence provided by 
papers of Fisher (1939a, 1939b) and Yates (1939) shows that no such rela- 
tion exists and that the authors suspecting it were misled by the incomplete- 
ness of earlier writings concerning fiducial argument. 

As a result of the present paper it may be found expedient, for the sake 
of clarity, to avoid confusion of terminologies appropriate to the two 
theories. Instead of writing, as some authors do, on “fiducial or confidence” 
limits, it may be preferable to discuss “fiducial limits” or “confidence limits,” 
as the case may be, separately. 


2. BASIC IDEAS IN THE THEORY OF CONFIDENCE INTERVALS 


The key to understanding the theory of confidence intervals is in being 
clear about what might be called the classical point of view in the theory 
of probability. This theory was originally built up to answer questions 
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about how frequently a given combination of throws will occur in a long 
series of games of dice. Thus, the probability of a certain combination 
found to be, say, 1/5, implies that this combination would appear in about 
20% of a long series of actual games. This agreement may, but need not, 
be observed. In the latter case, we would say that the assumptions under- 
lying the deduction were not realized by the actual experiments. The dice 
used were perhaps “biased,” and so forth. The point is that, whenever it 
is said that a given set of probabilities does refer to some phenomena, then 
it is understood that the relative frequencies of various aspects of the 
phenomena, in a long series of trials, are approximately equal to correspond- 
ing probabilities. This is just what the author calls the classical point of 
view in the theory of probability. It is excellently explained by v. Mises 
(1939), but is more general than the definition of probability adopted by 
that author.? 

Apart from the classical point of view on probability, there is another. 
It considers the probabilities as measures of rational belief in the truth of 
a given proposition. Here the agreement between the probability and some 
relative frequency is not essential. 

The theory of confidence intervals was built up to give a solution of 
problems of estimation which would have a clear frequency interpretation, 
characteristic of the classical point of view. Consider a set EH of n observ- 


able random variables, 21, :*:, 2%, and assume as given that the func- 
tion p(E | 61, 62, +++, 6s) represents its elementary probability law. Here 
6;, -**, 6; represent certain parameters whose values are unknown. 


The above should be interpreted as follows. There are some actual trials T 
which are able to determine the values of the z’s._ There are also some num- 
bers 31, #2, +++, &s, unknown to us, such that, whatever be a region w in the 
space of the 2’s, the integral of p(H | 01, Je, -++, 0s) taken over this region is 
approximately equal to the relative frequency with which the point £, as 
determined by the trials 7’, falls within that region w. The problem of esti- 
mating one of the parameters, e.g. 61, consists in using just one system of the 
x’s as determined by the trials T to calculate 3; approximately. Alternatively, 
it may consist in calculating an interval (a, a + d) which “presumably” covers 
dy. 

The original approach to this problem is based on Bayes’s theorem. De- 
note by p(61, 62, --:, 4s) the elementary probability law of the @’s. Then 


(01, +++, 0s)p(E’ | 01, +++, Os) 


foe fe@. ++ 0p" 6, «++, Oda «+ dy 


2 Tt will be noticed that the classical point of view on probability does not imply any 
particular definition of that concept. It is not suggested that the one adopted by 
v. Mises is the only one that could be consistently used. 


p(A1, 62, Ss 65 | E’) = (1) 
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will be the relative probability law, or the probability law a posteriori of all the 
6’s given the observed system E’ of the values of the z’s. It can be used to 
calculate the most probable value of 6,. Alternatively, given a number d > 0, 
the law can be used to find the interval (a, a + d) such that the a posteriori 
probability 

Plat+d>6>a|E'} 
is greatest. 

Our attitude towards this kind of solution, dictated by the classical point 
of view on probability, depends on circumstances and may be twofold. 

The circumstances of the problem may imply not only that the z’s but 
also that the 6’s are random variables and that the function p(61, ---, 0s) 
could be used to calculate the relative frequencies of various combinations 
of values of the 6’s. Such situations are rare, but they do occasionally 
occur, especially in problems of genetics and of mass production. If the 
function p(61, ---, 4s) is implied by the problem considered, then the prob- 
ability P{a+d>06,>a|H’} has a clear frequency interpretation, as 
follows. Imagine a long sequence, S, of cases where the 6’s vary according 
to the above law and the 2’s are determined by the particular trials con- 
sidered. Pick from this sequence S a subsequence S(H’) of such trials in 
which the experiments determined the same system of values of the 2’s, 
namely, the system H’. Naturally, the value of 6; in cases belonging to 
S(H’) would vary. But, if the functions p(E | 61, +--+, 6s) and p(01, +++, 9s) 
do have the presumed relation to the trials considered, it will be found that 
among all the intervals of length d, the interval (a, a+ d) will contain 
the value of 6; more frequently than any other, and that this frequency 
will be approximately equal to Pfa+d>6,>a|EH’}. It follows that, if 
the function p(6;, :*:, 4s) is implied by the circumstances of the problem 
of estimation, the use of the formula (1) is perfectly legitimate from the 
point of view of the classical theory of probability. 

The situation is quite different when the circumstances of the problem 
do not imply the a priori probability law. This is most frequently the case. 
Moreover, usually there are serious difficulties in considering the 6’s as 
random variables. Jeffreys (1939) advises the use of formula (1) also in 
such cases, with a function p(6;, --:, 6;) mvented for the purpose. He 
claims that the conclusions drawn in this way are valid, provided that the 
function used is just the one that he suggests. The present author would 
not question this statement on condition that the word “valid,” or any 
other such description, is not given any significance beyond that described 
above. In other words, there seems to be no reason why we should not 
agree to call the above conclusions “valid in the sense of Jeffreys.” On the 
other hand, it seems essential to be clear that any probability calculated 
from (1), with any function p(61, °--, 6s) not implied by the actual prob- 
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lem, need not and, generally, will not have any relation to relative frequen- 
cies. It will not be the probability in the classical sense of the word and, 
therefore, persons who would like to deal only with classical probabilities, 
having their counterparts in the really observable frequencies, are forced 
to look for a solution of the problem of estimation other than by means of 
the theorem of Bayes. 

This solution (Neyman, 1937, 1938b) may be obtained as follows. Con- 
sider the case where the circumstances imply that the z’s, forming a system 
E, are random variables with the probability law p(E | 61, 02, :*:, 6s), . 
where 6, 62, °**, 4, are unknown. Denote by 6(£) and 6@(£) two functions 
of the z’s. Obviously, if H is random then these functions will also be 
random variables. 

DeFIniTIon 1. If the functions 6(E) and 6(E) possess the property that, 
whatever be the possible value 31 of 6; and whatever be the values of the unknown 
parameters 02, 03, -+-, 9s, the probability 


P{9(E) < 31 < H(E) | 31, 0, ---, 06} = @, (2) 


then we will say that the functions 6(E) and 6(E) are the lower and the upper 
confidence limits of 01, corresponding to the confidence coefficient a. The interval 
[0(E), 0(E)] zs called the confidence interval for 6. 

In spite of the complete simplicity of the above definition, certain persons 
have difficulties in following it. These difficulties seem to be due to what 
Karl Pearson (1938) used to call routine of thought. In the present case the 
routine was established by a century and a half of continuous work with 
Bayes’s theorem. It may be useful, therefore, to give a few illustrations. 

Assume that s = 2, that 0, may have only the five values 1, 2, 3, 4, and 5, 
and that, at the same time, 62 may vary continuously between zero and 1. 
To satisfy Definition 1, the only requirement on the functions 9(Z) and 6(E£) 
is that 


P{Q(E) <3 < W(E)| 9, &} =a (3) 


for all values of 3 = 1, 2, 3, 4, and 5, and for 62 varying between (0, 1). The 
probabilities (2) and (8) are, therefore, not the probabilities of 6, falling within 
any limits. On the contrary, they are the probabilities of the functions 
6(E£) and 6(£) falling on both sides of a specified number 3. These proba- 
bilities are to be calculated from the given function p(F | 6,, 82) with the 
value of 6, set equal to the same number 0. The result must be totally 
independent of the values of 02, ---, 6; and must equal a. 

It is known (Neyman, 19356; Feller, 1938) that in certain cases no such 
functions §(Z) and @(E£) exist. Then there are ways of modifying the formula- 
tion of the problem, for example, requiring that the probability on the left 
of (2) be at least equal to a, and so forth. In other cases, there will be an 
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infinity of pairs of confidence limits all corresponding to the same a. In 
this case, the practical statistician is at liberty to choose among them. 

Let us now consider the frequency interpretation of the solution of the 
problem of estimation by means of confidence intervals. Suppose that some 
two functions (EZ) < @(£) possess property (2) with some large value of a, 
say a = 0.99. Their use in practice would consist of (i) observing the values 
E’ of the 2’s, (ii) calculating the corresponding values of the confidence limits 
6(H’) and 6(E’), and (iii) stating that the true value 3, of 6, lies between 
6(E’) and @(H’). The justification is simple and perfectly in line with the 
classical point of view of probability: in the course of many applications, the 
relative frequency of cases in which the statement 6(Z) < 3; < 6(£) is correct 
will be approximately equal to a = 0.99, whether or not the parameters for 
estimation are the same in all cases. 

The word “stating” above is put in italics to emphasize that it is not sug- 
gested that we can “‘conclude” that 9(H’) < 3, < 6(E’), nor that we should 
“believe” that 0; is actually between @(Z) and (EZ). In the author’s opinion, 
the word “‘conclude” has been wrongly used in that part of statistical litera- 
ture dealing with what has been termed ‘‘inductive reasoning.”” Moreover, 
the expression “inductive reasoning”’ itself seems to involve a contradictory 
adjective. ‘The word “reasoning” generally seems to denote the mental proc- 
ess leading to knowledge. As such, it can only be deductive. Therefore, 
the description “inductive” seems to exclude both the “‘reasoning” and also 
its final step, the ‘‘conclusion.”” If we wish to use the word ‘‘inductive” to 
describe the results of statistical inquiries, then we should apply it to “‘be- 
haviour” and not to “reasoning.” The fact that a given pair of functions 
6(E) and 6(E) satisfies the identity (2) may be ‘“deduced”’ from the properties 
of the function p(E | 6,;, ---, 0s). Earlier trials may show characteristics in 
the empirical distribution of the x’s which seem in agreement with the function 
p(E | 6;, --:, 6,). On these grounds, after observing the values of the z’s 
in a case where the 6’s are unknown and calculating 6(#’) and 6(E’), we may 
decide to behave as if we actually knew that the true value 3; of 6; were 
between 6(E’) and @(E’). This is done as a result of our decision and has 
nothing to do with ‘‘reasoning”’ or ‘‘conclusion.”” The reasoning ended when 
the functions 6(Z) and 6(E) were calculated. The above process is also devoid 
of any “‘belief’’ concerning the value 3; of @;. Occasionally we do not behave 
in accordance with our beliefs. Such, for example, is the case when we take 
out an accident insurance policy while preparing for a vacation trip. In doing 
so, we surely act against our firm belief that there will be no accident; other- 
wise, we would probably stay at home. This is an example of inductive 
behaviour. 

Obviously, if there are many different pairs of functions, 9(£) and 6(£), all 
corresponding to the same a, our choice of the one to use must be based on the 
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detailed study of their properties. For example, if it appears that the differ- 
ence between one pair, 0,(Z) — 6,(£), is always (or most frequently) smaller 
than that between some other pair, then we would probably prefer to use the 
first. The problem of determining the confidence limits and of studying their 
properties forms the subject of the theory of confidence intervals. 


3. NECESSARY AND SUFFICIENT CONDITIONS FOR A PAIR OF FUNCTIONS TO BE 
CONFIDENCE LIMITS 


Let a(E) < b(E) be any two single-valued functions of the x’s determined 
for all possible systems of their values. Denote by W the space of the z’s 
and by #; one of the possible values of 6,. Finally, let A(#,) denote the region 
in the space W composed of all points E' which satisfy the double inequality, 


a(E) < 3 < b(£). (4) 


It was proved (Neyman, 1937) that for the two functions, a(#) and 6b(£), 
to be the lower and upper confidence limits for the parameter 6, it is neces- 
sary and sufficient that, whatever be the possible value 0; of 6,, the probability 


P{E ¢ A(@) | 6) = 0)} = a. (5) 


The identity refers to the arbitrary variation of 62, ---, 0s. 

This condition will be used below to show that a certain pair of functions 
does not represent the confidence limits. For this purpose, the following 
steps will be taken: We shall select a convenient value 3; of the estimated 
parameter 6, and determine the region A(#;) as in (4). Next, we shall sub- 
stitute this same value @; instead of the parameter 6, in the elementary prob- 
ability law of the variables considered, getting p(E | TY, * + Os)e ne nieeieee 
function will be integrated over A(#,) to find the probability P{H «A | 1 = 
01} as in the left-hand side of (5). But this integral will be dependent on the 
values of the other parameters involved, showing that the identity (5) is not 
satisfied. The conclusion will be that the particular functions considered are 
not confidence limits. 


4, DIFFERENCES BETWEEN THE THEORY OF CONFIDENCE INTERVALS AND THE 
THEORY OF FIDUCIAL ARGUMENT 


In this section we will consider examples treated both from the point of view 
of confidence intervals and of fiducial argument. These will be selected to 
illustrate both the conceptual and the numerical differences between the two 
theories. . 

(i) Evidence of conceptual differences between the two theories.—The first 
results obtained concerning confidence intervals (Neyman, 1934) refer to the 
case where all the n observable variables z; are mutually independent, nor- 
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mally distributed, have the same though unknown standard error o, and 
expectations &(z;) which are linearly connected with some s < n unknown 
parameters 71, Po, -°-, Ds, SO that 


E(xi) = Gi1pr + aigpe +++++ ass. (6) 


Here the a’s are supposed to be known and to form a non-singular matrix. 
Denote by @ any linear combination of the same p’s, that is 


6 = bipi + bepe +--+ Depa, (7) 


with known 6’s not all equal to zero. In these circumstances, a confidence 
interval for @ is given by 


F — Sta [0S F + Sta, (8) 


where F denotes the best unbiased estimate of 9 (David & Neyman, 1938), 
S the estimate of the standard error of F’, and ¢, the value of the ‘‘Student’’- 
Fisher ¢ corresponding to the number of degrees of freedom n — s and to 
P=1-—a. The application of more recent theory (Neyman, 1935b) shows 
that the confidence intervals (8) have distinct advantages over any others by 
satisfying the definition (Neyman, 1937) of the “‘short unbiased system of type 
B,.”’ Without entering into these details, we shall consider the particular 
case where s = 1, aj, = 1 and 6b} = 1. This will be the case if all the z’s 
come from the same unknown normal population and it is desired to estimate 
its mean, 06 = &(z;). In that case F = & and 


D(wesed)" 
i n(n — 1) 

As mentioned, the general confidence interval (8) was discussed in lec- 
tures about 1930, and in 1932 a publication appeared using the concept and 
the formula (8). és 

As far as is known, the first full discussion of the corresponding result in 


the fiducial theory was given by Fisher a few years later (Fisher, 1935, 
1936), and here is the relevant passage from the second paper. 


S? 


(9) 


If asample of n observations, 21, - ++, Xn, has been drawn from a normal population having 
a mean value yu, and if from the sample we calculate the two statistics Z = Zz;/n and 
s? = D(x; — #)?/(n — 1), «+, “Student” has shown (1925) * that the quantity t, defined 


by the equation 
ppt he wv n 
= ——_——» 


s 


(10) 


is distributed in different samples in a distribution dependent only from the size of the 
sample, n. It is possible, therefore, to calculate, for each value of n, what value of ¢ will be 


3 Actually, of course, this result appeared earlier (“Student,” 1908). 
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exceeded with any assigned frequency, P, such as 1% or 5%. These values of é are, in fact, 
available in existing tables (Fisher, 1925-34). 

It must now be noticed that ¢t is a continuous function of the unknown parameter, the 
mean, together with observable values, Z, s and n, only. Consequently the inequality 
t > t, is equivalent to the inequality 





si (11) 
n 


Vv 


so that this last inequality must be satisfied with the same probability as the first. This 
probability is known for all values of ¢;, and decreases continuously as ¢; is increased. Since, 
therefore, the right-hand side of the inequality takes, by varying t1, all real values, we may 
state the probability that yu is less than any assigned value, or the probability that it lies 
between any assigned values, or, in short, its probability distribution, in the light of the 
sample observed. 

It is of some importance to distinguish such probability statements about the value of 
u, from those that would be derived by the method of inverse probability, from any postu- 
lated knowledge of the distribution of u in the different populations which might have been 
sampled. . . . To distinguish it from any of the inverse probability distributions derivable 
from the same data it has been termed the fiducial probability distribution, and the prob- 
ability statements which it embraces are termed statements of fiducial probability. 


BS t= — 


In the next section we shall analyze the above passage in detail and show 
exactly where and how it conflicts with the classical theory of probability 
and thus with the theory of confidence intervals. Here we will mention 
only that it is ambiguous. Just this kind of ambiguity, which is also found 
in the earlier papers (Fisher, 1930, 1933), is probably responsible for a 
number of authors, including the present one, thinking that the fiducial 
theory and the theory of confidence intervals are linked. 

In a few years it was found necessary to reinterpret formula (11). This 
was done by Fisher himself (1939b) and, somewhat more clearly but on the 
same lines, by Yates (1939). It will be seen from the following quotation 
from Yates’s paper that the above passage by Fisher certainly does not 
contain everything which is now considered essential in the fiducial theory 
and that the presumption of any link between the latter and the theory of 
confidence intervals is unfounded. Yates’s more relevant sentences are 
italicized by the present author. 

While explaining the meaning of the fiducial distribution of the mean p» 
of a normal population, Yates mentions that the fiducial distribution of ¢? 
is given by 

1 Xo" 


co L(x; — z)2. ee 


where x” has its usual distribution with n — 1 degrees of freedom. 


It can then be shown that, for a value of » equal to uw, and a given s, the value of # in 
subsequent samples would be as small as that observed in a fraction « of the samples, 
provided that the actual distribution of o” is the same as the fiducial distribution given above. 
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In this form, however, the statement is open to objection on the ground that in subse- 
quent samples o may in fact be distributed in any manner, and that s will certainly vary 
from sample to sample. To avoid this objection we must frankly recognize that we have here 
introduced a new concept into our methods of inductive inference, which cannot be deduced by 
the rules of logic from already accepted methods. .. . Thatis . .. the form of fiducial state- 
ment which is implicit in the ¢ test as ordinarily used by practical experimenters. .. . 
It must be recognized as essentially different from the statement that ¢ will exceed t in a 
fraction « of all experiments. The latter is true for any given fixed o or any set of o’s. The 
former (i.e., the fiducial statement, J.N.) zs true for a given s when o is taken to be fiducially 
distributed in the appropriate distribution. . . . The logical difference between the two ap- 
proaches (fiducial and inverse probability, J.N.) should, however, be recognized. The 
approach by inverse probability enables fiducial statements about » to be derived from the 
classical theory of probability, without the introduction of any new principle, but only at 
the cost of postulating a particular a priori distribution of c. In the fiducial approach such 
a priort postulation 1s regarded as inadmissible, but in order to discard ut a new principle, that of 
utilizing the fiducial distribution of «, must be introduced. . . . Once the principle is accepted 
it is possible, given % and s, to make formal and exact statements of the fiducial type about 
p which are independent of all prior knowledge of o. If the principle is not accepted, then 
it appears that we must either assume an a priori distribution of o, or deny that there is any 
possibility of making fiducial statements about u. 


The present author is unable to understand the exact meaning of what is 
called “fiducial statements about u.’”’ However, his conclusion is that their 
conceptual nature must be quite different from that dealt with in the theory of 
confidence intervals. This conclusion is based on the fact that all the diff- 
culties described by Yates as inherent in the fiducial theory are non-existent 
in the theory of confidence intervals. Applications of the latter require no 
new principle ‘‘which cannot be deduced by the rules of logic,” no assumption 
that this or that unknown parameter follows any specified distribution, and 
have no connexion with Bayes’s theorem. ‘To make the situation absolutely 
clear, imagine a sequence of normal populations 7, 72, +++, tm, °+:, with 
their means 6}, 02, «++, 4m, °*+ and their standard deviations 01, o2, +++,om,°°>. 
Imagine that out of each population 7, we have a random sample 2,, of n 
individuals, with its mean £,, and an estimate of the corresponding variance 
Sm? asin (9). The theory of confidence intervals guarantees that the relative 
frequency with which Z,, — tS will fall short of the corresponding @,, and, 
at the same time £, + tyS will exceed this same number @,,, will be, within 
an error of sampling, equal to a. An incredulous reader may easily check 
this by a sampling experiment. In this he will be at liberty to keep 6,, 
and/or om constant, or to vary them at his pleasure, without any restriction. 
Of course, the distributions of the populations sampled should be more or less 
normal and the sampling should be random. It follows from the above 
passages of Yates that if the requirements above are satisfied but no new 
principles accepted, then we have to deny that there is any possibility of 
making fiducial statements about 0,,. If so, then the nature of the latter is 
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different from those involved in the application of the theory of confidence 
intervals. 

The comparison of the above comments by Yates with those of Fisher 
gives a curious impression. Where Yates sees so many difficulties and 
restrictions, Fisher mentions none. Yet this very publication of Yates is 
fully endorsed by Fisher (1939b). 

(ii) Numerical differences between the two theories. —Besides establishing 
the existence of conceptual differences, it is essential to show that the two 
theories may give different numerical results. We may conclude from the 
discussion above that the application of confidence intervals requires fewer 
restrictions. But there is a logical possibility that, when both theories are 
applicable, they give the same numerical result. The following example 
shows that this is not the case and that fiducial limits need not satisfy the 
definition of confidence limits. 

The example that we are going to discuss refers to the problem of esti- 
mating the difference, say 5, between the means of two populations of which 
it is known only that both are normal. Denote by 

2 a A} eo 


(13) 
T2,1) %2,2,°**, L2,n’y 

two random samples to be drawn from these populations and let n<n’. 
The confidence limits for 6 have been very elegantly obtained by Bartlett. 
He did not publish his results himself but they are briefly mentioned in a 
paper by Welch (1938). The tendency towards a greater generality of 
presentation resulted in certain complications. The following is a less 
general but simplified statement of the results.4 Assume that the 2’s in 
(13) are numbered in the order in which they will be given by observation. 
Otherwise, randomize the second series. Next calculate n differences 


Uji = %M1,i — %2,1; (i 7m 1f 2, a n). (14) 


If &(21,;) = 0+ 6 and &(2e,;) = 6, then &(u;) = 6. If the s.p.’s of the two 
populations sampled are o and o’, then the s.5. of u; will be (0? + 07)”. 
The consecutive w’s will be normal and independent and the problem of 
estimating the difference between the means of two normal populations will 
be reduced to that of estimating the mean of one population of the w’s. Its 
solution is given by the confidence interval 


u— Sta) <6<a@+ Sta); (15) 


where S has an obvious meaning and tq) is to be taken with n — 1 degrees of 
freedom. 


4 Apart from these, the same author has obtained certain relevant results referring to 
the case where n = n’ = 2 (Bartlett, 1936). 
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Again, an experiment consisting in repeated sampling of pairs of normal 
populations will show that, whatever be 6, 8, o, o’, whether constant or 
varying in an absolutely arbitrary manner, the relative frequency of cases 
in which the statement about § in the form of (15) will be true will be 
approximately equal to «. The above solution of the problem, elegant as 
it is, is only a partial one. The results of Bartlett do not tell us whether 
the family of systems of confidence intervals found by him exhausts all 
the possibilities and whether it is possible to construct intervals which 
would be, in one sense or another, shorter than those given by (15). These 
are interesting and important problems and we may hope to have them 
solved. 


Remark added in 1951: Since the above lines were first published in Biometrika, it 
became apparent that, in the ideas described, Bartlett was anticipated by V. Romanov- 
sky (Atti del Congresso Internazionale dei Matematici, Vol. 6, 1928, pp. 103-105). Also, 
the problem of an optimum solution within the category outlined was solved by Henry 
Scheffé (Annals of Math. Stat., Vol. 14, 1943, pp. 35-44). Later on, Scheffé’s solution 
was extended to an analogous but somewhat more complicated problem by E. W. 
Barankin (Proc., First Berkeley Symposium on Math. Stat. and Probability, 1945/46, 
pp. 433-449). 


A result in fiducial theory corresponding to, but not equivalent with, 
formula (15) has been published by Fisher (1936): 


Let us suppose that a sample of n observations has yielded a mean, @, and an estimated 
variance of the mean, s”, so that s* = D(x; — )?/n(n — 1); then we know that if yu is the 
mean of the population 

w=£-+s1, (16) 
where ¢ is distributed in ‘“Student’s” distribution. Similarly, for the mean of a second 
population, of which we have n’ observations, we may write 

emcee ee (17) 

where ?’ is distributed in ‘“Student’s” distribution with n’ — 1 degrees of freedom, inde- 
pendently of t. If now 
Pura 20, z'—Z=d, (18) 
we find that 

e=d5—d=s'l' —st, (19) 
and since s’ and s are known, the quantity represented on the right has a known distribu- 
tion, though not one which has been fully tabulated. The equation may be written 


e = V(s2 + 8”)(t' cos R — tsin R), (20) 


where tan R = s/s’, so that Risa known angle. If t and ¢’ be taken as the co-ordinates of a 
point on a plane, the frequency of the observations falling within any area of the plane is 
calculable. The points for which 6 has any given value lie on a straight line, at a distance 
from the origin +e/(s? + s’?)2, and making an angle R with the axis of ¢t. The fiducial 
probability that « exceeds any given value is the frequency in the area above this line. Ifn 
and n’ are both increased, the distribution of « tends to be normal and independent of R; 
when R is 0° or 90° the distribution is of ‘“Student’s’’ form. In general it involves n, n’, 
and R and for any chosen probability, therefore, requires a table of triple entry. 
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As the reader will notice, no restrictions are mentioned and it is not sug- 
gested that for the practical application of the results any assumption is 
needed concerning the variability of the variances of the populations sam- 
pled. Neither is there any suggestion of any new principle that may be 
involved. We will return to this point below. 

Following the publication of Fisher just quoted, and on his advice, Suk- 
hatme published a table (Sukhatme, 1938). The quantity tabled may be 
denoted by f(n, n’, R) and represents the root of the equation | 


fo +e : 
f |G H(t’) ac dt = 0.025, (21) 
where G(t) and H(t’) are “Student’s” distributions with n — 1 and n’ — 1 
degrees of freedom respectively, while 
f(n, n’, BR) 
re (s? + s’*)*% cos R 

It follows from the context that f(n, n’, R) so calculated is the value such 
that the fiducial probability of its being exceeded by | € | /(s? + s'*)” is equal 
to 0.05. In other words, the values f(n, n’, R) are the fiducial 5% limits of 
| «|/(s? + s’*)* As e = 6 —d, if the presumption that the fiducial limits 
necessarily lead to confidence intervals be true then this means that the double 
inequality 


ze —#—fin,n', RV +8? <6< # —€4+ finn’, R)VS4+ 8” 
(23) 


must be the confidence intervals for 6 = pn’ — pw. But it is easy to see that 
the functions on the extreme parts of (23) do not satisfy the conditions, 
explained in § 3 above, necessary and sufficient for them to be the confidence 
limits. Take 6 = 0 and denote simply by A the region in the space of the x’s 
including all the points in which the inequality (23) is satisfied. Take the 
probability law of the z’s and put 6 = 0 in it, that is, u’ = uw. It will be seen 
that the integral J(A) of this probability law taken over A depends on the 
ratio p = a/o’ of the two o’s appropriate to the two populations sampled and, 
thus, that it does not satisfy the identity (5). 

Condition (23) defining the region A does not involve the particular z’s but 
only the means &, 2’, and the variances s* and s’”._ Consequently, to calculate 
I(A) we may start with the probability law of those four variables 


+ ittan R. (22) 


n—2 on! —2 





re | n(é — pu)? < n'(z’ — p)? a n(n — 1)s” i n'(n! — = (24) 


20? 20° 20° Qo"? 
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where c is a purely numerical constant and does not involve any of the pa- 
rameters. This function must be integrated over the region A defined by 
(23) or by the equivalent inequality 


jean SIAN yee (25) 


In dealing with it, we have to remember that FR is not a constant but is con- 
nected with s and s’ by the equation tan R = s/s’. The required integral, 
or probability, of £, Z’, s, and s’ satisfying (25) will be more easily calculated 
if we introduce a new system of variables, u, v, R, and sp. These will be con- 
nected to the old system as follows: 


= put us sin R, 


VA 


xt’ = uw + vs cos R, 





26 
S = sj sin R, oe 
$7="s9 cos. i. 
The Jacobian J of the transformation is easily found to be 
J = so° sin R cos R. (27) 
The limits of variation of the new variables are as follows: 
—o0 << u,v < +0, 
0 < S80; (28) 
0<R< dn. 
The probability law of the new variables will be 
P(U, 0, So, R) = —— sp"*" “te ¥'? sin”! R cos”’—! R, (29) 
CG OC 
with 
nu? sin? R nv? cos?R- =o n(n — 1) sin? RR n'(n’ — 1) cos? R 
y= 2 a" 12 oF 2 sh 12 0) 
oO oO oO Oo 
Inequality (25) will be equivalent to 
| vcos Rk — usin R | Sel (alt ei) (31) 


As this does not involve so the integration with respect to this variable can be 
carried out within the extreme limits of its variation. As a result further 
integrations may be performed on the probability law of u, v, R, 
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pu, v; R) 


if plu, V, So; R) dso 
0 


6 sit” 7.0. Cos”. tale 
- ee 
where c is again a numerical constant. 
Further integration may be conveniently carried out as follows. Substitute 


a new variable z for the variable v so that 


zt+tusnkR Ov 1 


cos R dz x cos R 





v (33) 


Keep z constant within the limits | z| </f(n, n’, R) prescribed by (81) and 
integrate for u from —%2 to +. The result is 


e w= | Ap ae 
c sin”? R cos” 2 R 


z R)= TON SS S53 
p( ) gt 14/2 14/ng!? Lt n'o” 
nn’ n(n — 1 eet Be 5 Ol ae 
x eee 2 ne Hl sin? R T sites cos” R| (34) 
no no (on oO 


The integration is completed by an easy substitution for z 


I(A) n!—1 f a | sin”? R cos”? R 
p } {n(n ~ 1) sin? R a n'(n’! nid Dye cos? Ry —2)/2 


aff dz 
x | + yet De dR, (35) 


with f = fm, n’, B) and 


nn! 


2 4 no’ a n'a? 7 
Mitel) aes 1a Sh) ee Ay 
—§——- sin* R + 7 cos’ R 

By inspecting (35) it is more or less evident that J(A) must depend on 

the value of p. However, to avoid any doubt in this respect, it was thought 

useful to calculate J(A) for a few values of p. This was done by Miss 

Elizabeth Scott of the Statistical Laboratory, University of California, and 

it is a pleasure to record the author’s indebtedness to her. The calculations 

involved supplementing the tables of Sukhatme for a denser set of values 
of R. The calculated values of J(A) are: 
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n=12, n’ =6 


p I(A) 
0.1 0.966 
1.0 0.960 
10.0 0.934 


Thus the functions representing the fiducial limits for § do not satisfy the 
conditions necessary and sufficient for them to be the confidence limits of 
the parameter in question. It follows that if pairs of normal populations 
forming a long sequence are sampled and the extreme parts of the double 
inequality (23) calculated, then the relative frequency of cases where the 
prediction of the value of § by means of these inequalities will be correct 
need not be equal to the expected 0.95. It will depend on the value of p 
and, if this is uncertain, this frequency will be unknown. Subsequent com- 
ments by Fisher (Fisher, 1939a) seem to indicate that the frequency in 
question is expected to approach 0.95 only if the ratio p is not constant 
but follows a certain fiducial distribution. It is noteworthy that no such 
restriction is to be found in the original work quoted above. On the other 
hand, it is more or less in line with those restrictions formulated by Yates. 


5. VIEWS OF M. S. BARTLETT AND R. A. FISHER 


The controversy in which the main contributors are Bartlett (Bartlett, 
1936, 1939) and Fisher (Fisher, 1937, 1939a, 19396) seems to be based on a 
misunderstanding. Presuming that the fiducial limits are always equal 
to confidence limits, Bartlett was puzzled by Fisher’s results concerning § 
just quoted, and suspected an error. The subsequent elaborations by Fisher 
and Yates amount to a confirmation that the values of f(n, n’, R) as tabled 
by Sukhatme do not provide the confidence intervals. But both authors 
are emphatic that there is no error in the original deductions, and that 
Bartlett misunderstood the problem. It is unthinkable that these four 
unanimous papers are mistaken and, therefore, we must accept the conclusion 
that the presumption of intrinsic identity between fiducial and confidence 
limits is unfounded. 

But it must be pointed out that, before the appeal to extra-logical prin- 
ciples was published, there was much to be said in favor of the opinion 
that the solution of Fisher, as quoted above, and the work of Sukhatme 
both involved errors in the algebra of probability laws. It also seems that, 
apart from establishing that the fiducial theory and the theory of confidence 


246 MATHEMATICAL STATISTICS AND PROBABILITY 


intervals are distinct, it will be of some interest to analyze Fisher’s work 
in detail and to point out exactly where and how it diverges from the rules 
of ordinary theory of probability on which the theory of confidence intervals 
is based. 

When a system of observable phenomena is treated mathematically, it 
is essential to be clear on exactly what is assumed as given or as known. 
For example, when trying to calculate the area of land from a certain set 
of measurements, it is essential to be clear as to assumptions made concern- 
ing the shape of the land considered. The available data may be consistent 
with a number of such assumptions, e.g. that the surface considered is a 
plane or that it is spherical with a given radius, etc. Whichever of these 
hypotheses is accepted as given, the applications of the appropriate for- 
mulae will give mutually consistent results. But they would not generally 
be consistent if one part of the calculations were made on one hypothesis 
and another on a contradictory one. The differences may be small, but in 
mathematics there are really no “small” nor “large” inconsistencies. There 
are simply inconsistencies. Needless to say, the choice of exactly what is 
to be accepted as given must be made to attain the greatest conformity with 
empirical facts. But this is a question which need not be discussed here. 

The above general principle also applies to the applications of prob- 
ability. There we must be clear as to exactly what are the phenomena or 
the variables which we agree to consider as random in a given inquiry. 
In practice, of course, the random variable will be the one whose value at 
the moment is uncertain and is being determined “by chance.” If X is 
considered as a random variable, the premises of the mathematical problem 
must include some assumptions as to the relative frequencies with which 
X assumes its possible values. These assumptions may vary in specificity, 
but they must be present in the premises. 

Any number or variable which is not random must be clearly recognized 
as such. For some time such non-random numbers were called constants. 
This was more or less satisfactory with constant numbers. But Fréchet 
(Fréchet, 1937) has noticed that we may also consider variables which are 
not random and has invented useful terms to describe them. These are 
“nombre certain,” “fonction certaine,” etc. We will translate these terms 
by “sure number” and “sure function.” The thousandth digit in the expan- 
sion 7 = 3.1415 ... is a sure number, although totally unknown to me. 
Denote by f(n) the relative frequency of 0’s among the first n digits of the 
same expansion of z. This will be a sure function. On the other hand, if 
¢(n) denotes the number of errors that may be made when calculating z 
to n places of decimals, then ¢(n) may be considered as a random function 
of n. Considerations of this kind would imply those of a considerable 
sequence S of similar attempts to calculate z, by the same person or by 
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different persons of a specified category, in which the values of ¢(n) will 
vary, as we shall say, at random. It is with respect to just such a sequence 
of determinations of the values of the function ¢(n) that our probability 
statements will refer. For example, if we either start or finish our calcu- 
lations with the probability equal to 0.25 of ¢(n) being between any two 
sure numbers a, b, then the applicational statement is that about 25% of 
the numbers of the sequence S satisfy the inequality a < ¢(n) < b. 

It is important to notice that the sequence S may consist of just one 
member; then all the proportions relating this “sequence” will have to be 
either 0 or 1. In other words, if the sequence of “random” determinations 
consists of just one element, this element will have the property of a sure, 
not a random, object, in the usual sense of the word. 

Now let us turn to the passage from Fisher’s paper quoted above, pp. 237-8, 
and try to see exactly what is supposed to be random there and what elements 
of the problem are treated as sure numbers or sure functions. These details 
in the set-up are not stated at the outset, but there is no difficulty in collecting 
them from appropriate passages in the paper. We first see that the function 
t of (10) is supposed to be “distributed in different samples... .” This 
means that ¢t is a random variable and that its randomness depends on what 
is found in those repeated samples, namely, the values of € and s._ It follows 
that the probabilities concerning #, s, and ¢ refer to the sequence S of those 
“different”? samples. The sequence could not consist of just one sample 
because, in such a case, the “distribution” of ¢ would not be anything like 
“Student’s” law. The references to a normal population sampled and to 
“Student’s” law indicate, on the contrary, that the sequence S of samples is 
very large indeed, and that the distributions in it are comparable to those 
represented by continuous curves. 

Up to this time we have not mentioned the population mean pu which is also 
involved in the expression of t. Obviously, this may be treated mathemati- 
cally either as a random or as a sure number. Both methods of approach 
are at our disposal but, in order to avoid inconsistencies, we must be clear as 
to which one we follow. The indication of Fisher’s choice is found a little 
further on in this article, in the place describing the distinction between the 
fiducial and the inverse probability approach: “It is of some importance to 
distinguish such (fiducial) probability statements about the value of y, from 
those that would be derived by the method of inverse probability from any 
postulated knowledge of the distribution of » in the different populations 
which might have been sampled.”’ This sentence does not seem to leave any 
ground for doubt. In the fiducial approach we consider but one population 
sampled and no distribution of » is postulated. Therefore, u is a sure number 
and, if t is distributed according to ‘‘Student’s” law, it is a result of the appro- 
priate variability of < and s alone. 
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The symbol ¢;, which also comes into play, is obviously a sure variable 
capable of any real value between —% and +. We may select it as we wish 
and then obtain the probability P(t,) of the random variable ¢ exceeding t, 
from tables. 

Following the article, we will readily agree with Fisher that the inequality 
(11), namely, u < @ — st,/ Vn, is equivalent to ¢ > t; and that it must be 
satisfied with some probability P(t,). Now consider the phrase: ‘Since, 
therefore, the right-hand side of the inequality (1.e.  — st,/ Vn) takes, by 
varying t,, all real values, we may state the probability that is less than any 
assigned value, or the probability that it lies between any assigned values, or, 
in short, its probability distribution in the light of the sample observed.”” From 
the point of view of ordinary logic and of ordinary theory of probability this 
phrase is inconsistent with the original set-up. The first inconsistency is 
involved in the words which are italicized, suggesting that < and s in the 
expression < — st,/ Vn are not random but sure numbers, referring to one 
particular observed sample. As a matter of fact this same inconsistency ap- 
pears earlier in the statement that ¢ — st,/ Vn, by varying ¢,, will run 
through all real numbers. If, as formerly, € and s are random with their 
variation appropriate to the sequence S, then, whatever value we choose to 
ascribe to t;, say t = 2, the expression  — 2s/ Vn is also random and depends 
on the outcome of sampling. 

Apart from this sudden shift in the meaning ascribed to & and s, there are 
two more inconsistencies. ‘To see the first of them, let us follow Fisher, 
changing our minds about # and s and considering them as sure numbers, 
determined by one particular sample. In this case the inequality uy < % — 
st, / Vn would contain no random elements at all: the first element, uw, is an 
unknown constant, the mean of a single population sampled, ¢ and s are fixed 
by the sample observed, and ¢,; is the value of the sure variable that we have 
chosen to consider. In these circumstances, the inequality may either be 
true or not true and the probability of its being true will equal unity or zero 
and have nothing to do with the probability or frequency P(t,) which this 
same inequality satisfies within a sequence S of many “‘different’’ samples. 

The last inconsistency refers, of course, to the point of view on pw. As we 
have seen above, it is first considered as a sure number, but the passage just 
quoted speaks of the probability of its lying between any assigned limits 
possible to determine from the values of P(t). Assume n = 4 and that the 
sample observed givesz = 10 ands = 2. Select ¢, = 0.765 and t,;’ = —0.765 
so that P(t,) = 0.25 and P(t,’) = 0.75. This would result in the supposed 
probability P’ of u lying between the limits 9.235 < u < 10.765, being equal 
to 4%. Trying to interpret this result in the light of the classical theory of 
probability, we have to conceive a sequence, say S’, of cases in 50% of which 
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pu falls between the above limits. But exactly what could this sequence be? 
Hither there is such a sequence and then we must also consider other popula- 
tions “which might have been sampled,” and postulate something about the 
distribution of u,® or else the ‘‘sequence” must be the degenerate one of one 
element only with the probability P’ equal to either zero or unity, but never 
to %. 

These are the points previously mentioned by the author (Neyman, 1934), 
which, from the point of view of classical probability, represent conceptual 
inconsistencies. ‘They are also present in the other passage of Fisher quoted 
on p. 241, but a similar analysis of that passage, supplemented by what has 
subsequently been done by Sukhatme, will reveal errors in algebra of proba- 
bility laws as well. These errors are particularly relevant from the point of 
view of the controversies between Bartlett and Fisher. 

The quantities considered in this passage are all dependent on the population 
means yw and yp’ and on the statistics and s of one random sample and on Z’ 
and s’ of the other. Our analysis will also require the consideration of the 
population variances o” and o’”. We must start by deciding on the random 
or sure character of all these quantities. Fisher’s remark that the two ratios 

pe ty ae (37) 


$ oS) 








are distributed according to ‘‘Student’s” law with appropriate degrees of 
freedom suggests that u and y’ are treated as sure numbers and that @, Z’, s, 
and s’ are random. There is no reference whatever to the variances o” and 
a”. As nothing is disclosed about what distribution they may possess, by 
analogy with the w’s it is natural to treat them as sure numbers also. 

In order to interpret every step in calculations more easily, we shall imagine 
two normal populations 7; and 72 sampled and a sequence A of pairs of sam- 
ples, of n and n’ individuals respectively, drawn independently from 7, and 
am. These pairs of samples will determine &, s, #’, and s’, generating distribu- 
tions appropriate to normal populations. Substituted into formulae (37) 
they will make ¢ and ¢’ vary to generate the two distributions of “Student.” 

With this in mind, let us examine the passage in which Fisher writes 


e=6—d=s't' — St, (38) 


and comments: ‘“‘Since s’ and s are known, the quantity represented on the 
right has a known distribution, though not one which has been fully tabu- 
lated.”’ We see here the same kind of sudden jump in the point of view on 
quantities considered as is found in the passage analyzed previously. For- 
merly s’ and s were not “known” but random. Otherwise, the distributions 


5 This is quite essential. Otherwise there would be an error in Bayes’s theorem. 
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of ¢ and t’ would not have been those of “‘Student”’ but would have been 
normal about zero and due solely to the variability of @ and z’. Now s’ 
and s are known sure numbers. Let us allow for this shift in conditions and 
try to visualize the character of the distribution of ¢ for fixed s’ and s._ For 
this purpose we have to consider not the whole sequence A of pairs of samples 
mentioned above, but only a subsequence B composed only of those pairs of 
samples in which the estimated variances have the same values s and s’ as 
the ones supposed to be ‘‘known.”’ The variability of « in the subsequence 
B will be the result of the variability of ¢ and z#’ only. It is known that the 
mean of a sample from a normal population is independent of the sample 
variance. Consequently the distributions of and z’ in B will be normal. 
As the connexion between ¢ on one hand and & and Z’ on the other is linear 
with constant coefficients, it would follow that the distribution of e in B 
would be normal also. Therefore, it is with some surprise that one reads 
Fisher’s suggestion that this distribution has not been fully tabulated. Evi- 
dently, when writing the sentence quoted, Fisher had something else in mind, 
probably depending on the new extra-logical principle described in subsequent 
publications. However this may be, we have to note the conflict between the 
sentence quoted and the rules of ordinary logic and of the classical theory of 
probability. 

The distribution of e by itself does not play any further role in Fisher’s 
work. Instead he and, subsequently, Sukhatme consider the ratio that we 
will denote by z = e/Vs? +s”. Fisher does not write any formula repre- 
senting the supposed distribution of z and we have to look for the details of 
his ideas in Sukhatme’s paper. Complimentary references to this paper in 
subsequent publications by Fisher suggest that it is perfectly in line with his 
own ideas. We quote the relevant sentence in Sukhatme’s paper, only alter- 
ing his notation to bring it into agreement with that of Fisher. 


He (Fisher) considers the distribution of 


€ 
V3? + 3 


for given n, n’, and F in order to obtain the probability that z exceeds any given value. 


zZ = ?t’/cosR —tsinR, (39) 


It is obvious at once that the probability in question does not refer to either 
of the sequences A or B visualized above. The appropriate sequence C of 
pairs of samples to which this probability refers is a part of the sequence A 
composed of all such pairs of samples in which the variances s and s’”, while 
variable, keep the ratio s/s’ = tan R = constant. Mathematically, the 
distribution sought is known as the relative distribution law of z given R 
and is denoted by p(z | Rk). If p(R) and p(z, R) are the absolute probability 
law of & and the absolute joint probability law of z and R, respectively, then, 
for every R such that p(R) > 0, 
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p(z, R) 
p(k) 

The relative probability, given R, of z exceeding a fixed number 2,, that is, 
Ue) 21 | Rk), will be obtained by integrating (40) for z from z, to +o. 
There is an alternative way of obtaining the same probability. This con- 


sists of first finding the relative joint probability law given R of t and ?¢’. 
If this is denoted by p(, t’ | R) then 


p(z| R) = 





(40) 


Pi{z > 2,| R} =i fi p(t, t’ | R) dt dt’, (41) 
w(a) 

where the region of integration w(z,) is determined by the inequality 

z=UlcosR —tsmR > 2. (42) 
A familiar formula gives 
p(t, t, R) 
DO ee (43) 
p(k) 


Whichever way, (40) or (48), is preferred, the resulting probability P{z > 
21 | R} will have the same value and will refer to the sequence C described 
above. 

Sukhatme has chosen to apply a quadrature procedure to calculate the 
integral (41) with the integrand equal to the product of two of “Student’s” 
distributions with n — 1 and n’ — 1 degrees of freedom respectively. This is 
just the error in algebra of probability laws mentioned above. The ¢ and ¢’ 
are distributed independently and in accordance with ‘‘Student’s” laws only 
in the sequence A where both the means and #’ and also the variances s* 
and s’” are undisturbed in their random and independent variation appropriate 
to samples from normal populations. When calculating the probability 
“for a given R,”’ we do not consider the sequence A but only its part C so 
selected that the ratio s/s’ is constant. This selection disturbs the original 
distribution of s and s’ and is reflected in the resulting joint distribution of 
t and ?¢’. 

In our calculations above (26) we have used the letters u and v for what is 
here denoted by t and #’. Consequently, the joint probability law p(é, ¢’, R) 
is obtained from (32) by merely substituting ¢ for wu and ¢’ forv. The absolute 
probability law of R is easily obtained by integrating (84) with respect to z 
between the limits —» and +. The result is 


e om! Pee 
sin”-- A cos” “fk 


Ves = n'—1 =e ee a a ae eee es 
owe {n(n — 1) sin? R + n’(n’ — 1)p? cos? RYH"??? 


(44) 


with c denoting a numerical constant. Substituting (82) and (44) into (48) 
we obtain 
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SS ie I 
{n(2 + n — 1) sin? R + n'(t? +n! — 1)p? cos? REC? 


with $(R, p) denoting a function of R, p,n and n’ only. pi, t’ | R) is just the 
function to be integrated to obtain the relative probability given R of ¢ and 
t’ to verify any inequality such as t/cosR —tsinR >. As one would 
expect p(t, t’ | R) appears to depend not only on FR but also on the ratio of the 
population variances p”. 

It follows that, from the point of view of the ordinary theory of probability, 
the Fisher-Sukhatme solution is wrong. The error consists in their confusing 
the absolute probability law of ¢ and ¢’, obtainable by integrating (32) for R, 
with the relative probability law given R of the same variables as given by 
(45). Some such error seems to have been suspected by Bartlett. Repeated 
denials and the reference to the extra-logical principle underlying the fiducial 
theory lead us to believe that from the point of view of that particular theory 
the error is non-existent. While accepting these explanations we may still 
regret that the earlier papers by Fisher and that of Sukhatme do not contain 
any clue as to how they are to be interpreted. 


p(t,t’| R) = (45) 


6. SUMMARY 


1. The theories of fiducial argument and of confidence intervals differ in 
their basic conceptions. The validity of the former requires, at least in 
some cases, the fulfilment of various restrictions of which the theory of 
confidence intervals is totally free, and/or the acceptance of some new 
principles impossible to deduce by the rules of ordinary logic (Yates, 1939; 
Fisher, 1939b). 

2. The two theories may occasionally give the same numerical results in 
the form of fiducial limits on one side and of confidence limits on the other. 
The problem of estimating the difference of means of two unknown normal 
populations shows, however, that this need not always be the case and that 
fiducial limits need not satisfy the definition of confidence limits. 

3. Bartlett’s criticisms of Fisher’s solution of the problem just mentioned 
seem to be due to his considering the problem from the point of view of 
ordinary theory of probability and ordinary logic. In this light Fisher’s 
solution does contain both conceptual misunderstandings (originally pointed 
out in the author’s paper of 1934) inherent in the very concept of fiducial 
distribution of a parameter, and errors in algebra of probability laws. Since 
the first references to the new principles outside of ordinary logic, which 
supposedly justify the fiducial theory, were published after the publication 
of Bartlett’s criticisms, the latter seem to be perfectly justified and useful. 

4, Owing to a certain flaw in the ideas underlying the fiducial theory 
which is noticeable in passages quoted in § 4, it is impossible to insist on » 
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any definite attitude towards it, except that of doubt. It may be useful, 
however, to express the following conjectures which seem to be very prob- 
able. If they are wrong then they will be put right and, as a result, the 
situation will be clarified. 

The present author is inclined to think that the literature on the theory 
of fiducial argument was born out of ideas similar to those underlying the 
theory of confidence intervals. These ideas, however, seem to have been 
too vague to crystallize into a mathematical theory. Instead they resulted 
in misconceptions of “fiducial probability” and “fiducial distribution of a 
parameter” which seem to involve intrinsic inconsistencies as described in 
$5. In this light, the theory of fiducial inference is simply non-existent in 
the same sense as, for example, a theory of numbers defined by mutually 
contradictory definitions. 

In earlier stages when the problems treated were very simple, the fallacy 
involved in “fiducial probability” was not apparent. Later on, however, 
difficulties appeared and the new principle “which cannot be deduced by 
logic” seems to have been invented to disentangle them in one particular 
case. But the word “principle” implies some generality, hence the drift in 
comments on the same subjects treated in 1936 and again in 1939. From 
the point of view of the direction of this drift it is perhaps significant that 
Yates speaks of “fiducial statements” possible to make on the ground of 
probabilities a posteriori and that the paper by Jeffreys which professes 
the equivalence of fiducial theory with that of inverse probability appeared 
in the Annals of Eugenics, edited by R. A. Fisher. 

However this may be, the only thing that the present author ventures to 
profess is that the theory of fiducial probability is distinct from that of 
confidence intervals. 
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Part 4. Stein’s Sequential Procedure 


(Based on a lecture at the Department of Statistics, University College, London, deliv- 
ered in March, 1950.) 


When Professor Egon 8. Pearson invited me to speak to you, he sug- 
gested that I describe some of the more outstanding results obtained in 
the United States during the last decade, which, because of war conditions, 
may not have received the attention that they deserve. As far as I can see, 
the most interesting result of this description is due to Charles M. Stein. 
With Professor Pearson’s and your permission, the subject of my today’s 
talk will be a brief account of Stein’s Sequential Procedure in estimating 
the mean of a normal distribution. Stein’s paper* was published in 1945. 
However, in order to appreciate fully his result and, also, in order to give 
due credit to another friend of mine, Dr. Joseph Berkson, I shall begin my 
story a little earlier. 

As you know, one of the earliest results in the theory of confidence 
intervals is the short unbiased confidence interval for the mean é of a 


1Charles M. Stein: “Two-sample test of a linear hypothesis whose power is inde- 
pendent of the variance.” Annals of Math. Stat., Vol. 16 (1945), pp. 243-258. 
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normal distribution with an unknown variance o?. If # and s? stand for 
the sample mean and for the estimate of variance of this mean, based on 
f degrees of freedom, then the confidence interval of the unknown mean é 
is given by 


E€-—-st<i<f+s (1) 


where ¢ is taken from Fisher’s tables in accordance with the selected con- 
fidence coefficient « and the number of degrees of freedom f. I have been 
describing this result in my lectures since about 1930, and it was first used 
by W. Pytkowski? in his booklet published in 1932. The theoretical back- 
ground is given in my J.R.S.S. paper of 1934. Finally, the corresponding 
result based on fiducial argument was published by R. A. Fisher in 1935. 
Furthermore, in my paper of 1937 published in the Phil. Trans. Roy. Soc., 
London, I have shown that the confidence interval (1) has the remarkable 
properties of “unbiasedness” and “shortness.” ‘Unbiasedness”’ means the 
property that, while the true value of € is covered by (1) with the pre- 
assigned frequency a, any other value is covered by (1) less frequently. 
“Shortness” means that, given a false value & of é, the confidence inter- 
val (1) covers & less frequently than (or at most as frequently as) any 
other unbiased confidence interval corresponding to the same confidence 
coefficient «. 

Analytically, these properties are expressed as follows. The property 
serving as the definition of a confidence interval {é,(£), é()} 1s 


P{fi(E) S&S &(E)|& 0} =a. (2) 
Here, as usual, the letter H stands for the random “event” point, i.e. for the 
set of all the observable random variables. The property of unbiasedness 
is expressed by the relation 


P{é(E) < # S &(E)|é, 0} S$ Plfa(E) StS &(B)|é, 6} (3) 


valid for all values of £, ’ and. Finally, the property of shortness, applica- 
ble to (1), is written as 


Pilgé—stsi? St+s|f0} SP{(E) Se S&(E)|E 0} (4) 


for all confidence intervals {é,(H#), é:(H)} satisfying (2) and (3), and for 
all é, & and o. 

I must admit that, having obtained this result, I thought that I had 
found a grand thing, not only interesting theoretically, but also important 
practically, and felt naively proud. Unfortunately, this inordinate pride 
was soon punctured by a letter from Dr. Berkson, expressed in polite terms 
but making it quite clear that the practical importance of the confidence 


2See references in part 3 of this Chapter. 
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interval (1) is rather limited. The humiliating part of the story is that 
Joseph Berkson is an M.D. and a practical, rather than a theoretical statis- 
tician, while I am supposed to be working in theory. Yet a delicate point 
regarding the confidence interval (1) was noticed by Berkson and over- 
looked by me. In this connection it seems appropriate to paraphrase the 
celebrated description of Chevalier de Méré due to Pascal which, at tea 
time, you see on the wall of the Common Room: ? II! n’est pas géométre, 
mais il a trés bon esprit et ca, comme vous savez, est un grand avantage. . 

The practical defect of the confidence interval (1) noticed by Berkson is 
that its length, viz. 2st, is a random variable and, what is more, a variable 
capable of assuming arbitrarily large values. In fact, by looking up Elder- 
ton’s tables relating to the distribution of y?, it is easy to compute the 
probability that the length of confidence interval (1) will exceed any pre- 
assigned limit. In order to appreciate the practical importance of this fact, 
imagine an M.D., engaged in some sort of routine analysis, applying interval 
(1) to estimate, say, the average sugar content in a patient’s blood. This 
estimate is needed in order to adjust appropriately the dose of an injection. 
If we grant all the approximations involved, it is obvious that frequently 
the particular determinations used by the M.D. will be concordant and the 
value of s will be small. In these cases, assertions (1) regarding the true 
value of € will be usable. However, in other cases the value of s will be 
large and then the assertion regarding é will be so vague, say from zero to 
100 per cent, as to be meaningless. 

Cases of this kind are, of course, familiar and you must have come across 
a substantial literature dealing with so-called “gross errors.” Gross errors 
must occur from time to time. However, the situation I have in mind is 
not concerned with gross errors but only with such variation of the estimate 
s as is implied by the postulated normal distribution of the particular deter- 
minations. 

Faced with the abnormal length of the confidence interval for the mean 
sugar content €, the M.D. can do only one thing: not use this confidence 
interval. This may be followed by taking another sample of blood and 
making a new series of determinations, or by computing a new confidence 
interval based on some of the original determinations after rejecting sus- 
pected “gross errors.” But these further steps concern us less than the 
predominant fact that a universal application of confidence interval (1) is 


3 Since the time of Karl Pearson, the decorations of the Common Room (where a 
friendly visitor may get tea at 3.45 P.M., irrespective of whether he—or she—is mathe- 
matically minded or not) include the following quotation from Pascal, written in 
beautiful Gothic: 


“Tl a trés bon esprit; mais il n’est pas géométre; c’est, 
comme vous savez, un grand défaut.” 
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impractical. Its use is limited to cases where the value of the estimated 
standard error s does not exceed a certain (no doubt, only vaguely deter- 
mined) limit 7. This, however, implies that the confidence interval (1) is 
not used at all. 

This is a point of some delicacy and it is worthwhile to emphasize it a 
little. As I have already mentioned, the term “confidence interval cor- 
responding to the confidence coefficient «” is used to describe the interval 
between two functions of the observable random variables (HZ) and é.(E) 
having the property of bracketing the true value of the estimated param- 
eter € with the preassigned probability a. If we equate 


£, (E) =f — st, 
£(E) = £ + st 


and use these two functions consistently to estimate &, irrespective of the observed 
values of and s, then the long run relative frequency of successful estimates 
will actually be a. However, if we restrict the use of these formulae to cases 
where s <7, then, strictly speaking, our estimating interval will not be 
bounded by functions £,(#) and £(#) defined above but by two new func- 
tions, say &;*(£) and &)*(#) defined as follows. 


£;*(Z) = &(#) — whenevers Sr, 
£;*(E) not defined otherwise, 


for 7 = 1, 2. For convenience of reference, the interval (£*, £*) will be 
described as the curtailed confidence interval for &. 

Unexpected as it may seem, the two functions &*(#) and £*(H) do not 
possess the properties of confidence limits, because the probability that they 
will bracket the true value of £ is less than a and depends on the value of c. 
Let us compute this probability, say P. This is the conditional probability, 
given s <7, that 

Gost se ee Le ee 


where £ represents the true value of the expectation of . We have 


_Pissn@-s StS2+ a) | bc} 
. P{s Src} 


P (5) 


In order to evaluate the denominator, we need the probability density function 
of s, say, 


C 
aes gf le— nfs?/20? (6) 
oO 


where c is a numerical factor, independent of c. 
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In order to compute the numerator, in addition to (6) we need the proba- 
bility density function of @, 


Vin pM E— 99/207, 
oV 20 


Owing to the independence of # and s, the joint probability density function 
of < and s is simply the product of (6) and (7). The numerator in (5) is 


P{(s S$ 7)(|#—&| S st) | &o} 


T E+st 

— _ o avn _ z— o ae 

i |e pe Tac e "@-O%72 “aa| ds. 
0 oW Qe Je 


(7) 


ao 


Similarly, 


P{s 


IIA 


Cc i 2/92 
tl o} = “f{ pHEMT 


FeO 


The integrals simplify if we substitute 


Vin — 6) | 





U; 
oO 
then let 
G(x) = - f “en 8? at (8) 
V/ Qe 0 
and, finally, put 
s 
Vn-=0 
oO 
Then 
Cc tVn/o 
Pes a—#] Ss)|&o} =f of eG) 
0 
Pis << r | oc} = ef ote dv 
+ nil? Jo 
and 


r/n/o 2/9 
i wre #24 (vt) dv 
Cea To yee or eee 
tV n/o : 
{if of 1e-P'/2 dy 
0 


It is seen that P is a weighted average of quantities 


G(vt) 


P= 
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where, for each v, the weight is represented by 


AC fg bana Be 
wip w/e, 


Since G(x) is a monotone function of x, increasing from zero to unity as x 
grows from zero to infinity, it is obvious that P depends on o and, namely, 
that as o is increased from zero to infinity, the value of P decreases from a 
to zero. ‘Thus, if 7 is fixed in one way or another, and we make assertions 
about é in the form 

€-st<is<#+st (9) 


only if s S$ 7, then the probability of this assertion being correct is always less 
than the chosen confidence coefficient a and, if c happens to be large, is close 
to zero. It follows that interval (9) used only when s S 7, is not a confidence 
interval. 

As you know, the properties of confidence interval (1) are connected with 
Student’s distribution. This has an extensive use in testing Student’s 
hypothesis which ascribes a specific value & to the mean é of the normal 
distribution but fails to specify the value of the standard error c. Student’s 
test consists of the rule to reject the hypothesis tested when the criterion 

Bae 
S 
exceeds a specified value t. 

This test was proved 4 to be unbiased of type Bl, which means that it is 
the most powerful test of all tests which are unbiased. Yet, you must be 
aware of the fact that it has an unpleasant property. This property is that 
the power function of this test depends on the unknown value of o. In fact, 
the argument of the power function is 


peeaplitons's Ih 7a 


oO 


where é stands for the true value of the mean. As p is increased, the power 
function tends to unity and there are some tables from which its values can 
be read. One of the uses for which these tables are intended is to estimate how 
large should n be in order to have a reasonable chance of detecting the false- 
hood of the hypothesis when the true mean é differs from the hypothetical 
value & by a stated amount. Upon inspecting the expression for the argu- 
ment of the power function you will see that, when nothing is known about o, 
it is impossible to answer this question. In fact, however large be n, if o 
is sufficiently large, then p will be as small as desired and the value of the power 
function close to the chosen level of significance. 


4J. Neyman: “Sur le vérification des hypothéses statistiques composées.” Bull. Soc. 
Math. de France, t. 63 (1935), pp. 246-266. 
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When we have some knowledge of o, for example if we know that o cannot 
exceed a specified limit, then the tables of the power function of Student’s 
test can be used to estimate the upper bound of needed to insure that the 
power function does not fall below a desired level. Similarly, if we know 


the upper bound of o, we can select sufficiently large values of 7V nso that 
the probability P of success in estimating & using the curtailed confidence 
interval be at least equal to a specified value a. In many cases the value of 
o is not entirely unknown and then both the power function and the curtailed 
confidence interval for — are usable. Otherwise, we face a very unpleasant 
difficulty. 

As you know, the properties of a test are determined by the corresponding 
critical region. Similarly, the properties of a confidence interval are those 
of the corresponding regions of acceptance. 

With this in mind, a question occurred to me as to whether or not it is 
possible to find critical regions for testing Student’s hypothesis and regions 
of acceptance for estimating & having more satisfactory properties. From the 
critical region, w, we would require that it correspond to a preassigned level 
of significance, 

P{Eew| fo, o} =e 


and that for sufficiently large values of | fy - EI, the power function B(&, 
og | w) of the region w have sufficiently large values irrespective of the value 
of c. Of course, it would be most satisfactory if B(é, o w) were independent 
of o and could tend to unity as | & —é& | is increased. As regards the regions 
of acceptance, we would require that they correspond to the preassigned con- 
fidence coefficient a and that the length of corresponding confidence intervals 
never exceed a fixed finite number /. 

It is easy to see that there is a connection between the two questions. In 
fact, a negative answer to the question regarding the critical region implies 
a negative answer to the question regarding the regions of acceptance. To see 
this, assume for a moment that a system A of regions of acceptance A(£) is 
found, corresponding to the confidence coefficient a = 1 — e, such that the 
length of the corresponding confidence intervals does not exceed M. We shall 
see that this assumption implies the existence of a critical region w correspond- 
ing to the level of significance « and such that, whenever | fo — é| > M, 
the power function 

B(é,o|w) 21 —e 


irrespective of the value of c. 

In order to prove this proposition, notice that, if | f& —& | > M, then no 
confidence interval can cover both & and &. This means that the region of 
acceptance A(é) and the region of acceptance A(é) have no points in com- 
mon. In other words, A(é) lies entirely within the region w = W — A(&). 

















STATISTICAL ESTIMATION 261 


Now select the region w = W — A(é) as the critical region for testing 
Student’s hypothesis that ascribes to & the value &. Using the basic prop- 
erty of the region of acceptance, we have 


P{E ¢A(fo) | fo, 0} =a=1—e. 
Hence 


Thus, the region w corresponds to the preassigned level of significance. 
Assume now that the hypothesis tested is false and that the true value & 
differs from &) by more than M. The value of the power function correspond- 
ing to this value is 


B{E, o| wv) = P{Hew| & o}. 
But, as we have noticed above, the region w includes A(£). Hence 
B(é, o| w) = P{HeA(é)| to} =1—«, 


irrespective of the value of c. Q.E.D. 

The questions just described were attacked by Dr. George B. Dantzig, 
then a colleague of mine, and were answered in the negative. Studying the 
structure of regions similar to the sample space with regard to o, while & = £5 
is kept fixed, Dantzig found that, if the power function of such a region is 
independent of o, then the region is similar to W not only with respect to c, 
but also with respect to & and, therefore, 


B(é, o|w) =e 


identically in ando. This result appeared in print. Furthermore, Dantzig 
proved a more general proposition: Whatever be the region w, similar to the 
sample space with respect to o, when — = &, and whatever be £ ¥ £o, the upper 
limit of its power function B(E, o | w) as ¢ — © cannot exceed «. The proof 
of this proposition is very simple. It is known ° that the asymmetric Student’s 
test has the property of being the uniformly most powerful test of Student’s 
hypothesis tested against the set of admissible hypotheses ascribing to the 
mean £ values on one side of the hypothetical value &. Assume, for example, 
that £ > &. Then the most powerful test of the hypothesis that & = & 
tested against the alternative & = & has its critical region, say w, defined by 
the inequality, 
<i> £0 - st(e), 


5 George B. Dantzig: “On the non-existence of tests of Student’s hypothesis having 
power functions independent of o.” Annals of Math. Stat., Vol. 11 (1940), pp. 186-192. 
6 J. Neyman and E. S. Pearson: “On the problem of the most efficient tests of statis- 
tical hypotheses.” Phil. Trans. Roy. Soc., London, Ser. A, Vol. 231 (1983), pp. 289-337. 
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where t(e) is a suitable constant. It follows that 

B(é, o | wo) = B(f, o| w). 
However, it is known that 


lim 6(&, | Wo) = €, 


co—- @ 


and it follows that 


IIA 
fan) 


sup lim B(&, | w) 


Thus, within the n-dimensioned space W there are no “‘satisfactory”’ critical 
regions for testing Student’s hypothesis and, consequently, there are no sys- 
tems of regions of acceptance which generate confidence intervals whose 
length is bounded, i.e. does not exceed a fixed number //. 

As you see, the situation is unsatisfactory. It was in this unsatisfactory 
state that it was faced by Stein, then in the United States Army. He had 
been assigned to study some statistical problems connected with weather fore- 
casting and, hence, forced to learn some theory of statistics. 

There were, among the things that Stein read, the now celebrated papers of 
Abraham Wald, dealing with so-called ‘‘sequential analysis,’’ which, however, 
seems to be more appropriately called ‘‘sequential procedures.”’ Generally, 
a sequential procedure in testing a statistical hypothesis consists of a repeated 
application of a triple rule: (a) to reject the hypothesis tested on data avail- 
able, (b) to accept it on the same data or (c) to make a specified number of 
fresh observations. You begin by observing, say, ny random variables X,, 
Xo,--:, Xn, The totality of these observations is represented by the sample 
point #;, in the n; dimensioned space W;. ‘The space W, is divided into three 
parts, W1(a), Wi(b), W1(c), and, following the determination of EH, the statis- 
tician takes action a, b or c according to whether EF falls in W,(a), in W1(b) or 
in W,(c). In the latter case, he makes ng fresh observations Xn,41, Xn,+49, 

-+, Xniin, This number ng may be preassigned or, again, it may be a 
random variable, a function of H;. If neg is a fixed constant, then the no 
new observations combine with the original n; to determine a point, say Ho, 
in the n, + 2 dimensioned space Weg. If ng is a random variable capable of 
assuming arbitrarily large values, then, in order to “accommodate”’ the sample 
point H» it is necessary to consider the space of infinitely many dimensions. 
This is also true in the frequent case where at every stage of sampling there 
exist possible sample points at which the statistician will take more and more 
observations. 

Early writings of Wald and of his colleagues were mostly concerned with 
sequential sampling procedures of testing a simple hypothesis against a single 
simple alternative. However, these writings inspired Stein with the idea that 
spaces with infinitely many dimensions are somewhat “wider” than spaces 
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of a finite number of dimensions. Hence, if in the spaces of finitely many 
dimensions there are no desirable critical regions for testing Student’s hypothe- 
sis nor desirable regions of acceptance for estimating the mean of a normal 
distribution, such regions may exist in the space of infinitely many dimen- 
sions. After some effort, Stein invented a two-step sequential procedure 
which proved that his presumption was correct. 

Using this procedure we obtain a test of Student’s hypothesis corresponding 
to a preassigned level of significance e, with a power function independent 
from o and tending to unity as the “error”’ of the hypothesis tested is increased. 
Moreover, for any given £ ¥ &p it is possible to arrange that the power func- 
tion at the point & be equal to a preassigned value B > 0, as close to unity as 
desired. 

The same sequential procedure leads to confidence intervals for the esti- 
mated £ which both correspond to a preassigned confidence coefficient a and 
have a preassigned length 2A. The originality of the idea and the elegance 
of the solution are above all praise. 

I shall begin by explaining Stein’s procedure of obtaining the confidence 
interval. Next I shall show that it has the properties indicated. Thereafter 
the procedure of testing Student’s hypothesis will be more or less evident. 

Let a and 2A denote, respectively, the preassigned confidence coefficient and 
the preassigned length of the confidence interval. Stein’s procedure consists 
of making two sets of observations. The first set of an arbitrary number 
n, = 2 of observations 

X1, X2, hate, S85 xe 


is obtained and certain calculations are made. The result of these calcula- 
tions determines the number nz = | of observations 


RE og Xe, eae) Diherel 5. 


of the second set. Then the two sets are combined to determine the confidence 
interval for the unknown mean £é of the normal distribution sampled. 

The calculations relating to the first set of observations, leading to the 
value of nz are as follows. Denote by 7(a) the value of Fisher’s ¢ corresponding 
to the number of degrees of freedom n; — 1 and to the confidence coefficient 
a. In other words, 7(a) is the root of the equation 


7 (a) dt he: dt 
J 2 \m/2 ad af 2 \m/2" 
1 (1 fn ) 
( 3 iy :) nm —1 


Having read 7(a) from Fisher’s tables, we compute the expression 


(ay : 
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where S? is the estimate of variance computed from the first set of observations 





Tea es 
S? = D(X; — X)? 


Ls i 1 c=" 
with 
_ \ hagute 
Xj Shoal >} Ne 
N41 i=l 
Now we determine n2._ If @is less than n;, then we put m2 = 1. Otherwise, 


if 6 = n,, then nz is given the value equal to the least integer which exceeds 
6 —n,. It will be seen that in either case 


We shall need this inequality at a later stage. It will be seen that the 
greater S, the greater the value of ne. 

When the value of v2 is determined, we make the second set of observations 
and compute the corresponding mean, say 


1 ni+ne 
Xo aaa ee ae »s Dee, 
N2 i=n+1 


Stein’s confidence interval for ¢ is then given by the following formula 
aX; + (1 —a)X2-AS&50X,+ (1 —a)X.+A (12) 
where the value of a is obtained from the equation 


re ronal aged 


N1 ng 0 


(13) 


You will observe that the difference between the extreme parts of (12) is 
equal to 2A. Thus, the only thing which requires proof is that, given that 
the mean of the sampled normal distribution is £, the probability 


P {aX, + (l-—a)X. -A SES aX%1+ (1 —a)X.+ Al sch =a. 
This identity can be rewritten as 
P{|aX,+ (1 —a)X,—£| S AlEch =a. (14) 


In order to prove (14) we first verify that, with the described selection of 
the value of m2, equation (13) has real roots. Upon multiplying by 726 
and sorting out terms, this equation may be rewritten as 


(ny + ng)0a” — 2n,0a + 1(6 — ne) = 0 
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and the two roots are 


N10 + V nynob(ny + ng — 8) 
ee a ee 
(ny + no)O 
Because of (11) the roots are real. Either one can be used in (12). 

The proof of (14) is based on the following elements. The mean X, is 
independent of S and is normally distributed about & with variance o?/nj. 
The number vg is a single valued function of S, and hence a random variable. 
So also is a. The mean X, depends on S only through the number ng of 
observations on which it is based. Thus, for a given S, the conditional 


distribution of, say, 
X = aX%,+ (1 —a)X2 


is normal, with expectation 


E(X) =af+ (l—ajé=é 


a” 1 — a) 
Tse = (~+°—") a” 
N14 ne 


It follows that the conditional distribution, given S, of 
aX; -+(1—a)X2—£ 


az {= 2 
ae Osa 
N41 ne 


is normal about zero with variance o”. A further consequence is that the 
absolute distribution of the quotient 


aX; + (1—a)X2—& 
a (1-—a) 


S./—+ 


ny uD) 


and variance 








is Student’s distribution with n; — 1 degrees of freedom. ‘Thus recalling 
the definition of 7(a), we have 


P aX + (1 —a)Xa — §] S 1(a) | g, i =a. (15) 





a? 1a 9 pe 
ee 
ny ne 
However, because of (13) and (10) we have 
a 1 — a)? A 
R 2 MRCOG 
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and it is seen that (15) coincides with (14). This completes the proof of the 
assertion that (12) represents a confidence interval corresponding to the con- 
fidence coefficient a. 

As to Stein’s test of Student’s hypothesis, formula (14) is very suggestive. 
As you must have guessed, if & is the value of the mean specified by the 
hypothesis, the test criterion is, say 


Y = aX,+ (1 —a)X2 — &, 


and the symmetric test consists in the rule of rejecting the hypothesis when- 
ever | Y | exceeds A. According to formula (14), this test corresponds to the 
level of significance e = 1 — a. 

The test just described contains two arbitrary elements. One of them is 
the level of significance e and its choice must be governed by considerations 
of the importance of avoiding errors of the first kind. The choice of e deter- 
mines a and hence 7(a). The second arbitrary element in the procedure is 
A. In the theory of Stein’s confidence interval, A plays an independent role. 
It represents one-half of the length of the confidence interval and is selected 
as such. I shall now show that in the theory of Stein’s test of Student’s 
hypothesis, the arbitrariness of A may be used to insure that for a given value 
of the difference fy — é | the power function of the test has a preassigned 
value 6. 

Denote by B(é | A) the power function of Stein’s test. If £ stands for the 
true value of the mean, we have 


B(E| A) =1-—P{|Y| SA] é,o}. 


The value of the probability in the right hand side is easily computed by 
noticing that 


¥ = (3X1 Ge (i a) Xe eet) (eco) 
and by recalling that 
aX; + (1 — a)X,—£ _ (a) 


Se ge ee ee 
ates a) A 


ny up) 





(aX, + (1 — a)Xz — &) 


follows the Student’s distribution with n,; — 1 degrees of freedom. Easy 
algebra gives 


P{|Y|S Algo} = 


Pib(é, A) — r(a) S$ “ (aX; + (1 — a)X2 — £) S$ b(E, A) + ra) | t,o}, 
(16) 




















STATISTICAL ESTIMATION 267 


where, for the sake of brevity, 


It follows that P{| Y | S A | £,o} is equal to the integral of Student’s proba~ 
bility density function with n; — 1 degrees of freedom taken over the interval 
of length 27(a) centered at b(é, A). Therefore the power function is 


fics dt 

=. {2 n1/2 

eres) 
i >= 1 


f 2 ni/2 
aes 
May = 1 


Obviously, for fixed r(a) and £ ¥ &, the value of | b(E, A)| is close to zero 
when A is large and goes to infinity as A decreases. At the same time the 
value of B(é | A) varies continuously from ¢ to unity. Thus, for given values 
€ and £, the value of A can be adjusted so that B(é | Ajo= Be a.Q2heD: 

Casual inspection of formula (16) may cause a sensation of surprise at the 
conclusion just reached. However, this sensation disappears when one re- 
calls that a change in the value of A makes a change in the value of 


2 
ca (=) 
A 

and this, in turn, influences the number ng of observations of the second set 
on which the mean Xo is calculated. The greater the desired power corre- 
sponding to a given &, the smaller must be A, the larger 6 corresponding to an 
observed S, and the larger no. Thus, with Stein’s procedure, we can pre- 
assign both the level of significance « = 1 — a and the power corresponding 
to a chosen size of error in the hypothesis, | fy — é |. However, the more 
exigent we are in either respect, the more observations will be needed to 
achieve the desired goal. 

The above account of Stein’s work does not cover all of his results and, 
if you study his paper, you will find it interesting and informative. Among 
other things you will find in it the description of another procedure, slightly 
more efficient than that described, and a generalization of these results to 
the case of the general linear hypothesis. 

The most essential advance achieved, as I presented it, consists in the 
shift from studies of sample spaces having finitely many dimensions to 
studies of the sample space of infinitely many dimensions, and the proof 





a(é| A) = 1 — 
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that in the latter there are various possibilities which are not available in 
the former. Stein proved this point by giving an ingenious example. Now 
the door is open for a search for an optimum sequential procedure. This 
problem appears rather difficult, but Stein has already obtained some rele- 
vant results which will soon appear in print. 

I began this lecture with a reference to some correspondence with Joseph 
Berkson in which he complained that, since the original confidence interval 
(1) for estimating the mean of a normal distribution is unbounded and, 
from time to time, must be inordinately long, a consistent use of this interval 
in practical work is impossible. 

Now, this difficulty seems to have been removed by means of Stein’s work. 
Brilliant as his result is, we must realize that its practical applications 
involve a new difficulty, just as insuperable as that complained of by Berk- 
son. This difficulty is connected with the fact that in the course of repeated 
attempts to apply Stein’s procedure, the observed value of S will be exceed- 
ingly large from time to time and will determine a correspondingly large 
value of ng. Likewise, it is obvious that, if the M.D., of whom I spoke at 
the beginning of this lecture, is advised to make an additional nz = 1,000,000 
determinations of sugar in a patient’s blood, he will refuse. Thus, in all 
practical work it will be unavoidable to apply some sort of “curtailed” 
Stein procedure. However, this conclusion need not inspire us with undue 
pessimism. A strict accordance between practical work and a corresponding 
theory is never possible and yet all our life is based on constant practical 
applications of inapplicable theories. For example, we postulate that the 
M.D.’s analyses follow a normal law of frequency whereas it is quite plain 
that none of his determinations can be negative and none can exceed 100. 
By assuming normality we substitute “improbability” instead of “impossi- 
bility” and are content. So does Berkson. If he now complains of the 
possibility of tremendous values of nz, we may point out that such values 
are extremely improbable and may advise him to be satisfied by making a 
reverse substitution of “impossibility” instead of “improbability.” 


Now, I wish to add a little postscript to the above lecture on Stein’s 
results and to the whole collection of lectures and conferences assembled 
in this book. This postscript has to deal with the general character of 
statistical research and with the ties that exist between the pure mathe- 
matical theory of statistics and the applied work. I deeply regret the not 
infrequent emphatic declarations for or against pure theory and for or 
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against work in applications.’ It is my strong belief that both are important 
and, certainly, both are interesting. The Berkson-Dantzig-Stein incident 
just recounted provides an excellent illustration of the view that, thus far, 
mathematical statistics is still in its early phase of development and that 
the various fields of applied statistical work constitute the source of inter- 
esting problems of theory. The results of Dantzig and Stein are certainly 
contributions to pure theory of statistics. Yet, whether the two authors 
are aware of the fact or not, the theoretical problems they solved originated 
from difficulties in applied work. Further development of mathematical 
statistics, and also the success of university instruction of statistics, 
depend upon maintaining close contact and a harmonious balance between 
mathematical direction of thought and the various fields of application. 


7 Quite recently I was shown some letters regarding myself. One very nice person 
wrote “I met Neyman. In general he is O.K., but hopelessly mathematical... .” The 
letter of another equally nice person stated: “Once upon a time Neyman did some real 
work. Now, however, he is interested in applications.” 
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